Project Goal:
Develop a framework for analyzing historical sales of Swire and competitor products to develop a machine learning model that predicts and prescibes the best products to launch, expected demand, best time of the year, region, and/or optimal price.
Business Problem:
Swire Coca-Cola operates extensively, producing, selling, and distributing beverages across 13 states in the American West. The company is known for its regular introduction of new, often limited-time, products aimed at stimulating market demand. However, predicting accurate demand for these innovative products remains a significant challenge. Historical data provides some insight but is insufficient for precise forecasting due to variations in regional and demographic appeal.
The firm stands at the forefront of the Western US beverage distribution sector, continually launching limited edition products to maintain consumer interest and market dominance. Yet, the uncertainty in demand forecasting for these unique items presents risks of either overproduction or shortages, each carrying potential financial and reputational impacts. The project aims to leverage historical sales data to enhance the accuracy of demand predictions for future innovative offerings. In essence can the data science team do a better job of predicting demand for new innovation products than a traditional market research survey, or the gut instinct of a seasoned sales manager?
Ultimately the following questions should be answered for any products worth purusing:
Product 1: Which 13 weeks of the year would this product perform best in the market? What is the forecasted demand, in weeks, for those 13 weeks?
Product 2: Swire plans to release this product 2 weeks prior to Easter and 2 weeks post Easter. What will the forecasted demand be, in weeks, for this product?
Product 3: Which 13 weeks of the year would this product perform best in the market? What is the forecasted demand, in weeks, for those 13 weeks?
Product 4: Swire plans to release this product for the duration of 1 year but only in the Northern region. What will the forecasted demand be, in weeks, for this product?
Product 5: Swire plans to release this product for 13 weeks, but only in one region. Which region would it perform best in?
Product 6: Swire plans to release this product for 6 months. What will the forecasted demand be, in weeks, for this product?
Product 7: Swire plans to release this product in the Southern region for 13 weeks. What will the forecasted demand be, in weeks, for this product?
Analytic Problems:
The analytics problem at hand involves developing a robust framework for analyzing historical sales data of Swire and competitor products. The primary objective is to construct a predictive machine learning model that forecasts sales but in addition other models as simple as multiple linear regression for prescribing the optimal products to launch. This entails estimating expected demand, determining the best timing for launch, and recommending pricing strategies. A significant challenge lies in managing the variability in release durations of both historical and future products, which span from 13 weeks to six months. As these are innovation products many key feature simply do not exist in the historical data adding layers of complexity to the modeling problem.
Purpose of this Notebook:
The purpose of this notebook is to first provide a high level overview of the ML process and results for the Swire Capstone project. The project is focused on forecasting sales for new products and existing products with new flavors. The question of which potential innovation items to model is broken down into 7 key sections. The goal is to provide a high level overview of the results from multiple linear regressions that served as the basis for the recommended products to further modeling using XGBoost. The results from the XGBoost models will be used to provide further recommendations on which products to launch based on expected demand and pricing. By the end of this notebook, we will have a clear understanding of the potential products to launch and the expected sales for each product, while also answering the key unique questions posed by Swire to each innovation.
Given the data size (24M rows), where each row represents a transaction between a manufacturer of soft drinks and an unknown vendor in 13 western US states, the term “Data Mining” could not be more descriptive of the task at hand. While each row looks similar to the other, each real-world sales transaction captured within each observation is very different from its neighbors. Preparing the data required far more thought than an initial glance would lead. We realized that the Swire data set was hundreds (maybe thousands) of data sets combined and aggregated into one massive 2D frame. With that, our first step was ensuring that the data was whole, devoid of errors and blanks - this was typically the case, and we only had to impute .2% of missing values by running simple text analytics on the data frame to ensure it was going to work with any ML algorithm. The data set provided was very well constructed and generally error-free.
Scaling the data down to handle fewer manageable observations proved to be the first task, and it would be conducted differently for any given question. The outliers that we observed when sampling the data were “natural” outliers, and they exist throughout the data. We observed instances where distributors held firm competitive leads in different geographic areas (Market Keys) and, at times, made huge sales, captured in some rows, that threw off an average size of most transactions. These outliers, however, were not errors; they were simply a natural effect of doing business, necessary to moving inventory and generating profit. In several instances, we observed single cases of soft drinks sold at 100x below average price, as if a freebie or incentive was offered to move the product. In other cases, many rows reflected huge inventory sold at standard prices with a 100x dollar sales value given the size of the deal. We observed, across subsections of soft drinks, that the Kansas market was an outlier in and of itself: 3 unique zip codes in 3 unique market keys sold thousands of products beyond the figures of much more significant market areas in heavily populated places such as Seattle, Washington. None of these interpretations could be understood without sampling the data along clear segmented lines (by flavor, brand, manufacturer.) None of the outliers in any sub-dataset proved to be erroneous, and we discarded no outliers to accommodate the need for accuracy in our modeling.
A tedious element of data preparation was understanding the Market Keys and creating appropriate bins to capture them adequately. If market keys spanned multiple US states, we binned into a general geographic region such as “NORTHERN” or “MOUNTAIN.” If other market keys were solidly inside a specific state and metro area, we assigned them to a more specific region, such as “COLORADO” or “NEW MEXICO.” The diversity of each market key was profound, with some market keys spanning huge rural areas and few towns and other market keys representing equally sized populations in three or fewer zip codes. There is far more to explore regarding market keys and geographic regions, including incorporating US Census data. Given the scope of the task, we did not incorporate census data into our calculations. However, we see it as an influential additive element to future more specific targeting of products to local areas.
The ITEM column represented most of the data contained in other rows of the data frame. We spent much time culling information from each ITEM captured in other variables within observations. As an example, “PACKAGE” was formally captured in a column within the entire dataset, but it was also represented in the ITEM column as well; so too, “diet,” “enhanced water,” and “energy” were correctly flagged in each observation, and noted within the ITEM category. Handling and removing redundancy proved essential as “ITEMS” shrinks dramatically when these redundant elements are strategically excised from the ITEM column. By cleaning up the ITEM column, we reduced what seemed like an infinite number of products of a unique flavor down to a small handful of manageable ITEMS that could then be easily manipulated in an ML model. The PACKAGE column was of much interest, and ultimately, we broke down each PACKAGE description (separated by spaces) into a one-hot-encoded element for each observation. Doing so made it much easier to compare products and use our models to divine value from the variety of packaging that exists in combination. The importance of the information in the PACKAGE column can not be overstated. Moreover, we caught ourselves saying that “Swire and competitors are mostly packaging and delivery companies.” Swire, as a client, proved keenly interested in “packaging” in its questions. Through our data preparation and subsequent feature engineering, each string and substring in any given row has a fuzzy albeit real numeric value and proved critical to our modeling; the key was removing redundant information in each row.
Our modeling process went from simple and fun to complex and technical. First, we used elementary multiple linear regression models natively built into R to dive in and understand each of the seven tasks. Each team member utilized a standardized recipe to prepare data and then ran a GLM on three subsections of the data to answer each question. First, we sampled 10% of the data and filtered it by “BRAND” related to the innovation flavor. We also filtered by “flavor” to observe competitor products, estimate demand, and identify the coefficients of the properties of products that drove demand for that particular flavor. We made data frames of individual flavors and brands to compare and contrast both to see similarities and observe initial anecdotes of the market space where a specific innovation flavor could quickly appear. In some instances, it was clear that Swire concocted ideas that were easy wins. In other cases, some ideas likely fail to succeed from a quick review of descriptive analytics and the output of a GLM. From this initial process, we focused on three specific innovation products worthy of deeper analysis. A note on our use of Linear Regression models: we only used this model to gain immediate insight into our products, we did not use Train/Test splits to run predictive models to reinforce our decision making, but the output of these models gave us enough initial information to make simple decisions based on the client questions put forth.
We entertained some naive ideas, assuming that the data would lead us to answer complicated questions such as estimating demand for a product that never existed. We thought it possible (or easy) to create data for innovation products to be utilized as secondary test data to divine unit sales. We went so far as to generate synthetic data for extrapolating unit sales from elements that did not exist but were drawn from assumptions in the data. Given the disparity between existing model assumptions trained and tested on actual historical data and the distinct characteristics of innovation product scenarios, we soon faced almost insurmountable technical problems. This forced us to return to basics and lean on time-tested ML processes to become even more knowledgeable about our products and domain. This was a hard-earned lesson.
Our team relied on eXtreme Gradient Boosting (XGBOOST) due to its highly efficient and versatile implementation of gradient-boosted decision trees designed for speed and performance. Given its precision in making predictions, we decided to use this method, taking our assumptions of the initial linear models to the next level. The best way forward for us was to know with as much accuracy as possible what elements of products were more highly favored or influential than others in creating demand. We used XGBOOST to break down and evaluate the assumptions we made about each product with an extremely high level of accuracy. Knowing the elements of the innovative products, based on the significant factors that sell competitively similar products, would be the primary way to answer client questions without resorting to even more sophisticated methods to divine near-impossible answers. We engineered our datasets into one-hot-encoded, manageable, more comprehensive tables and ran them through our XGBOOST models. If the task at hand was a classic prediction problem, we would feel very optimistic about the quality of our models predictive power. Using this predictive power in a different way, we drew upon the accuracy and output of our XGBOOST models to bolster our assumptions and solutions in order to provide Swire with very solid answers. In this case, and to be sure our models could answer questions, we split all our data into Train/Test sets and employed 5 fold cross validation techniques to ensure we were not contaminating data. The results of the XGBOOST models almost always reinforced the initial information gleaned from the GLM models, and then added better information with high quality predictive power to substantiate our claims.
We employed the XGBOOST model on innovation products that had more in common with similar historic products. We did not employ XGBOOST on innovation products that, upon initial review, had more disparate properties in common with historical products that they could potentially be grouped among. That is not to say we underestimate the effectiveness of a totally new and unique package or flavor, but given we can only baseline customer expectations on past behavior, our objective was to identify innovation products that showed more consistency with neighboring products.
Again, if this was a classic prediction problem, where we had to estimate the unit sales of a string of transactions we could almost guarantee excellent results. Multiple iterations of our XGBOOST models on training and testing data consistently returned R-squared and adjusted R-squared in the high .90s. Our models demonstrated remarkable predictive accuracy, achieving an impressive R-square and adjusted R-square of around 0.95 in many instances, indicating that it successfully captured most of the variance in our engineered dataset. While there are slight variations in MAE, RMSE, and MAPE between the training and testing phases, these discrepancies highlight the model’s nuanced understanding of the data rather than significant shortcomings. In this case, we are highly confident in the claims, suggestions, and recommendations we make about the historic (real) products relative to innovation products for the benefit of Swire. That said, we owe almost all of our model performance to the very highly correlated UNIT_SALES to DOLLAR_SALES - these two variables are obviously very powerful together. When we remove DOLLAR_SALES from our models, the R2 drops to over .50 for training and testing. However, what we are left with very effectively identifies the additional properties of soft drink sales that are independent of sales price.
We admit that the quality of our model doesn’t answer specific questions directly; there is a great deal of interpretation and insight we must provide to the client in order to hit home runs. We also attribute its predictive power to the stability and reliability of the Swire dataset. So what we can say is that the data is good, even if it has been aggregated together and some of the every day elements of sales transactions and the consumption habits of consumers are long lost. Our models also only explain part of our final recommendations that we will provide to the client, Swire. The modeling that follows essentially gives us the minimal context to begin to argue for our recommendations.
A note on DEMAND: Across this effort, “demand” is conceptualized not as a direct measure of consumer interest, but as a variable that varies significantly across different products and categories. The use of “unit sales” as a proxy for demand is inadequate due to the presence of aggregated sales agreements between manufacturers and retailers, which can distort actual demand metrics. Therefore, the determination of demand will be approached through alternative prescriptive analytics methodologies, separate from the current machine learning framework, to address its multifaceted nature effectively.
It is obvious that Swire’s data science team, our client, hopes to get ahead of its marketing department’s costly and laborious market surveys and product testing efforts related to innovation products. Predicting consumer interest in a product can not be done in a vaccuum. However, we, like our clients at Swire, are believers in data science and machine learning and we believe that, like a well crafted algorithm, we can get very good at predicting what may be successful according to historic data, even if that data is an amalgamation of predecessor products or based on elements of past success.
For our presentation to Swire, we will present a clear evaluation of 2 or 3 of the questions asked of us. We will draw on the power of the following effort and lessons learned to create a demand forecast from pricing optimization strategies (not illustrated in this notebook). What we present will be drawn from conclusions made in this notebook from the exploration of innovation ideas and historical neighbors.
Ultimately our goal is to create an innovation data set, something that represents and resembles the essential elements of a successful historical product without violating best practices in data science and ML modeling. We also are cautious in making statements that we cannot back up without data and results of our models. Once we have that stable footing, our team will use prescriptive analytics to make suggestions for our client that can get well ahead of market surveys and divining consumer interest through taste tests at state fairs. Whether we can create that data set remains to be seen, but we are confident we are much of the way there and we can, after this modeling exercise, answer most of our clients questions without making up statistics on the spot.
We have provided the following modeling in several sections, each focusing on the seven potential Swire innovation products. These aim to provide the high-level initial results from multiple linear regressions that served as the basis for recommended products to model using XGBoost. Summaries of each follow. Some comments are provided in individual sections when necessary. Most comments are captured inside the R code output. Most code has been suppressed mainly as it has lengthy output. We have generally included the exploration of each proposed item using a GLM, mostly for our own edification and education. From there we chose three products we felt had, intially, the most promise to pursue further. We rallied around those three products using XGBOOSTED models to get deep into their properties and provide our assessments and recommendations.
Analysis: There are two innovation aspects involved: a new package for Diet Smash and adding an existing flavor (Plum), that has never been a part of the Diet Smash line up of drinks. The linear regressions indicate high potential for forecasting, especially for the small Plum Multiple Linear regressions. Diet Smash shows significance in terms of seasonality, particularly in Summer and Winter. However, Plum by itself is on the edge of not being significant for Summer and Winter. Packaging alone is not as strong for Plum and Diet Smash. Diet Smash comes in two regular types with one size that ran for four weeks. The regression in the innovation data frame suggests PACKAGE12SMALL 24ONE CUP as the configuration that sold the most amongst its standard products. Considering the analysis, we do not recommend further pursuit of Diet Smash and Plum compared to other opportunities. It appears that while there is potential for forecasting, the significance of Plum in specific seasons is not entirely clear. Additionally, the innovative packaging proposed doesn’t seem to provide a strong advantage for the product. Further analysis may be needed to determine the viability of launching this product, particularly in terms of seasonal performance and packaging effectiveness.
MODEL Employed: Multiple Linear Regression
RECOMMENDATION: This innovation product concept has more disparate properties than (has less in common with) historical products. As a result, may prove more difficult to predict sales or model as well as other innovation concepts for this notebook effort.
Based on this brand, Sparkling Jacceptabletlester does not traditionally sell well across all variables. It appears to do better when it is REGULAR as opposed to DIET. Its 1.25L MULTI JUG, PACKAGE20SMALL MULTI JUG, PACKAGE2L MULTI JUG & PACKAGE1L MULTI JUG sell better than most other packaging. Only some of its smaller packaging PACKAGE12SMALL 6ONE CUP, PACKAGE12SMALL 8ONE SHADYES JUG, PACKAGE7.5SMALL 8ONE CUP are okay sellers but not as strong as larger sizes. In terms of April, in any given year SparklingJ does not do particularly well in April. We see a winter uptick; not a hot summer seller based on looking at season and month. SparklingJ has a very small window of UNIT_SALES (<3000) to DOLLAR_SALES (<5000). Does better in Colorado and Northern regions. Avocado is a popular flavor, representing more than 7% of all flavored soft drinks in the market (~ 1.8M sales observations.) Avocado flavored drinks do not fair as well as SSD (Standard Soda Pop)
MODEL Employed: Mulitple Linear Regression
RECOMMENDATION: Based on the client questions asked, this innovation product may be the least similar to its would-be competitors already in the market. People would find it a little out of place. The sales of SparklingJ have never been as strong as others. Although the sales of Avocado flavors are solid, this innovation product would possibly stand out from others, and potentially not in a good way. This product would likely need some market research, something we are hoping to avoid.
The dataset for Kiwano and energy drinks is limited, with very few rows. Despite this limitation, there is still potential to develop a model that can predict launch sales. There exists an argument to create a model that predicts the sales of units of energy drinks, specifically with a size of 16 and kiwano flavor. This model could potentially be used in conjunction with the current sales rate of VENOMUS BLAST launches to provide an accurate forecast. Determining the optimal weeks for selling the product is a challenge. Historical best 13 weeks sales of either VENOMUS BLAST, energy drinks, or kiwano flavored drinks could be used as a reference for identifying potential sales periods. Given the limited data available for Kiwano and energy drinks, it is essential to approach the recommendation cautiously. Additionally, determining the best weeks for selling the product relies heavily on historical sales data, which may not accurately reflect future market conditions.
In exploring and modeling for Kiwano we found there is a sub set of that data provides to be explanitory and has some potential for prediction. In our modeling a subset of the data was selected by filtering products that matched category energy, item containing “Kiwano”, and size contating “16.” Then for our specific questions of when to launch and how much demaned we added in week of the year and week since lauch. With our data set we created an XGboost model that showed some predictive power. Finally, we created a dummy set of data and had our model create preditions for every combination of our features to help predict future sales.
MODELS Employed: Mulitple Linear Regression, XGBOOST
RECOMMENDATION: Overall the unique qualities of Kiwano created a path where we were able to make some predictions on what window of time would be best to launch and take a first step at unit forecasting by week. We recommend this flavor sensation for addition study to go to market.
We do not think that we should recommend this product. We have very few observations of our Square Brand, and only 19 of those reside in the Northern Region. When widening the scope to look at other factors such as the flavor, category, and region as a whole, it continues to not make sense to sell this product for a whole year in the Northern Region. The Northern Region consistantly demonstrated a negative relationship with total dollar sales in comparison with the other regions we reviewed. The Square brand has been primarily sold in California and the Southwest so far and we don’t have sufficient data to say that this brand would do better than the other sparkling waters that have not done well in the Northern Region. From a time perspective, it also does not make sense to sell this for the whole year. The spring and especially the fall do far better than the winter in all of the regression models we ran. We wouldn’t want to recommend selling it during months where it would not perform well. From a flavor and category perspective, we were able to gain more insights such as the best type of packaging associated with the flavor, but the northern region continued to perform poorly with the flavor mulberries.
MODEL Employed: Mulitple Linear Regression
RECOMMENDATION: We would not recommend moving forward with this product due to the small amount of historical data that we have with the brand and flavor and the poor performance of the region with similar products.
“GREETINGLE Woodsy Yellow” appears to be a good opportunity for further pursuit based on available data. There are only two brands selling three Woodsy Yellow flavored products. The client asked “Swire plans to release this product for 13 weeks, but only in one region. Where would it perform best?” We unequivocally state that the Kansas market (East Kansas) is an excellent choice. And while this may be due to skewed sales represented in the East Kansas region, it would also do well in the Northern Region (East, NE Idaho, Oregon, and Washington). The packaging as proposed would not be a stretch to conceive and sell with the proposed packaging. Greetingle as a brand has seen relatively good dollar sales even with smaller volumes of unit sales. And it could use some company. There is really only one product on the market worth pursuing as a benchmark for success: “TITANS NOURISH WATER BEVERAGE WOODSY YELLOW” presents as a leading competitor. These factors led us to conduct deeper Greetingle Woodsy Yellow as a good idea for a small batch, limited market product.
MODELS Employed: Mulitple Linear Regression, XGBOOST
RECOMMENDATION: Recommend further study for a limited release product to select markets.
“DIET MOONLIT” demonstrates decent sales performance, ranking 69th in total revenue out of 288. This indicates a moderate level of success in the market for Diet Moonlit alone. Our modeling of soft drinks with Casava, particularly coupled into the Energy category, returned good coefficients and modeling results, suggesting that there is good potential for this flavor variant, particularly in a highly caffeinated situation. The analysis identifies some 21-week stretches that appear promising for making six-month predictions. However, it is noted that the challenge lies in selecting which weeks to use for these predictions. Despite “Cassava” not being considered a top-tier flavor option, the analysis suggests the potential for “DIET MOONLIT” given its decent sales performance and good results when we run both GLM and XGBOOST models on the Cassava “innovation” data frame. While Diet Moonlit has 2L Multi JUG success, there is no combination using that package format for Casava and Energy. As Swire has success in the 2L Multi JUG arena for Diet Moonlit, there is potential for a successful product even if Casava and Energy are not historically found in that package format. Casava and Energy are historically only found in a regular caloric segment; however, combining the high predictability of Casava Energy and the high sales of Diet Moonlit, there is potential for a successful product. With identified promising weeks for predictions, there is an opportunity to leverage this information to enhance sales forecasting and strategic decision-making further.
MODELS Employed: Multiple Linear Regression, XGBOOST
RECOMMENDATION: Based on the provided information, we recommend to continue work on consideration to bring Diet Moonlit Casava Energy to market.
Through our analysis of a ‘Pink Woodsy’ flavored launch, there was very little evidence that further modeling would create a reliable prediction. As the historical data is missing many accurate features we would like to see in order to explain variation. A few of the features from this specific innovative product that are missing are: A. Lack of comparable flavors. Though there have been products in the past with Pink or Woodsy, there have never been any items with this combination. B. Brand ‘Peppy’ having no innovative product data. In our research of the brand, we found they do not have any innovation data that would give us indications of how a new product would compete if launched. C. Lack of definition of which regions or areas would be considered ‘South’ for this launch.
MODEL Employed: Multiple Linear Regression
RECOMMENDATION: With these crucial factors either being excluded from modeling or using best estimates on the ‘closest’ items, we do not believe moving forward with prediction of this would be advised. With a product such as this, any type of trial data or directions on which items would be most comparable would help assure accuracy.
3 of the 7 potential innovation products were selected and recommended for further modeling. The XGBoost models were run on these products and the results are summarized below. The four other models we have not included for brevity and given our methods and processes are generally shown in the linear regression review sections below.
Item Description: Greetingle Health Beverage Woodsy Yellow .5L 12One Jug a. Caloric Segment: Regular b. Market Category: ING Enhanced Water c. Manufacturer: Swire-CC d. Brand: Greetingle e. Package Type: .5L 12One Jug f. Flavor: ‘Woodsy Yellow’ Q. Swire plans to release this product for 13 weeks, but only in one region. Which region would it perform best in?
df <- readRDS("swire_no_nas.rds") #load in EDA cleaned data (no nas)
#fixes the market keys to regional areas...
regions_joinme <- read.csv("states_summary.csv")
unique(regions_joinme$REGION)
## [1] "NORTHERN" "DESERT_SW" "PRAIRIE" "CALI_NEVADA" "MOUNTAIN"
## [6] "SOCAL" "ARIZONA" "NEWMEXICO" "NOCAL" "COLORADO"
## [11] "KANSAS"
# "NORTHERN" "DESERT_SW" "PRAIRIE" "CALI_NEVADA" "MOUNTAIN" "SOCAL" "ARIZONA" "NEWMEXICO" "NOCAL" "COLORADO" "KANSAS"
str(regions_joinme)
## 'data.frame': 200 obs. of 2 variables:
## $ MARKET_KEY: int 13 70 179 197 272 352 32 33 44 50 ...
## $ REGION : chr "NORTHERN" "NORTHERN" "DESERT_SW" "DESERT_SW" ...
# Perform a left join using the merge() function
df <- merge(df, regions_joinme[, c("MARKET_KEY", "REGION")], by = "MARKET_KEY", all.x = TRUE)
rm(regions_joinme)
# Update CALORIC_SEGMENT values: 0 if 'DIET/LIGHT', otherwise 1
df$CALORIC_SEGMENT <- ifelse(df$CALORIC_SEGMENT == "DIET/LIGHT", 0, 1)
df$MARKET_KEY <- as.character(df$MARKET_KEY)
df <- df %>%
mutate(
MONTH = as.numeric(substr(DATE, 6, 7)), # Extract the month from YYYY-MM-DD format
SEASON = case_when(
MONTH %in% c(12, 01, 02) ~ "WINTER",
MONTH %in% c(03, 04, 05) ~ "SPRING",
MONTH %in% c(06, 07, 08) ~ "SUMMER",
MONTH %in% c(09, 10, 11) ~ "FALL",
TRUE ~ NA_character_ # This is just in case there are any undefined values
)
)
# Making a 10% sample of the data to shrink it
set.seed(123) # Set a random seed for reproducibility
sampled_df <- df[sample(1:nrow(df), 2446143), ]
rm(df)
#skim(df)
summary(df)
## MARKET_KEY DATE CALORIC_SEGMENT CATEGORY
## Length:2446143 Length:2446143 Min. :0.0000 Length:2446143
## Class :character Class :character 1st Qu.:0.0000 Class :character
## Mode :character Mode :character Median :1.0000 Mode :character
## Mean :0.5025
## 3rd Qu.:1.0000
## Max. :1.0000
## UNIT_SALES DOLLAR_SALES MANUFACTURER BRAND
## Min. : 0.04 Min. : 0.0 Length:2446143 Length:2446143
## 1st Qu.: 11.00 1st Qu.: 36.5 Class :character Class :character
## Median : 40.00 Median : 135.1 Mode :character Mode :character
## Mean : 173.43 Mean : 587.4
## 3rd Qu.: 126.00 3rd Qu.: 427.4
## Max. :91778.00 Max. :409159.3
## PACKAGE ITEM REGION MONTH
## Length:2446143 Length:2446143 Length:2446143 Min. : 1.000
## Class :character Class :character Class :character 1st Qu.: 3.000
## Mode :character Mode :character Mode :character Median : 6.000
## Mean : 6.283
## 3rd Qu.: 9.000
## Max. :12.000
## SEASON
## Length:2446143
## Class :character
## Mode :character
##
##
##
# Create a scatter plot with the regression line, colored by MANUFACTURER
# Gives a good overview of the data sampled
ggplot(df, aes(x = UNIT_SALES, y = DOLLAR_SALES, color = MANUFACTURER)) +
geom_point(alpha = 0.5) + # Adjust alpha to avoid overplotting, if necessary
geom_smooth(method = "lm", color = "black", se = FALSE) + # Add linear regression line
labs(title = "Linear Model of UNIT_SALES vs. DOLLAR_SALES by MANUFACTURER",
x = "UNIT SALES",
y = "DOLLAR SALES") +
theme_minimal() +
theme(legend.position = "bottom") # Adjust legend position if needed
## `geom_smooth()` using formula = 'y ~ x'
# create a table of total values by brand
brand_summary <- df %>%
group_by(BRAND) %>%
summarise(
total_units_sold = sum(UNIT_SALES),
total_revenue = sum(DOLLAR_SALES),
avg_price = total_revenue / total_units_sold,
total_days_sold = n() # Count the number of rows for each brand
) %>%
arrange(desc(total_revenue)) %>% # Order by revenue in descending order
mutate(rank = row_number())
summary(brand_summary)
## BRAND total_units_sold total_revenue avg_price
## Length:288 Min. : 1 Min. : 1 Min. : 0.5315
## Class :character 1st Qu.: 2310 1st Qu.: 7563 1st Qu.: 2.0861
## Mode :character Median : 94691 Median : 266075 Median : 3.0291
## Mean : 1473003 Mean : 4989427 Mean : 3.2661
## 3rd Qu.: 651385 3rd Qu.: 2161764 3rd Qu.: 3.7252
## Max. :40414038 Max. :159387186 Max. :42.9378
## total_days_sold rank
## Min. : 1.0 Min. : 1.00
## 1st Qu.: 121.8 1st Qu.: 72.75
## Median : 1988.0 Median :144.50
## Mean : 8493.5 Mean :144.50
## 3rd Qu.: 8075.8 3rd Qu.:216.25
## Max. :124603.0 Max. :288.00
print(brand_summary[brand_summary$BRAND == "GREETINGLE", ])
## # A tibble: 1 × 6
## BRAND total_units_sold total_revenue avg_price total_days_sold rank
## <chr> <dbl> <dbl> <dbl> <int> <int>
## 1 GREETINGLE 2170595. 5740637. 2.64 48787 47
GREETINGLE is a good brand ranking 47th out of 288 brands in terms of total revenue, with an average price of $2.65 slightly below the overall mean of $3.27.
# Filter the dataframe for only 'GREETINGLE'
filtered_df <- df %>%
filter(BRAND == "GREETINGLE")
# Create the plot
ggplot(filtered_df, aes(x = UNIT_SALES, y = DOLLAR_SALES)) +
geom_point(color = "red", alpha = 1) + # Bright red points with full opacity
geom_smooth(method = "lm", color = "black", se = FALSE) + # Add linear regression line without confidence band
labs(title = "Linear Model of UNIT_SALES vs. DOLLAR_SALES for GREETINGLE",
x = "UNIT SALES",
y = "DOLLAR SALES") +
theme_minimal() +
theme(legend.position = "none")
## `geom_smooth()` using formula = 'y ~ x'
GREETINGLE has a wide body of groupings, where the one wildly steep grouping shows revenue higher at lower volume unit sales for some odd reason. This is clearly a product of odd deals made outside the norm.
# Sales by Week of the Year
filtered_df %>%
mutate(DATE = as.Date(DATE)) %>%
mutate(WEEK = as.integer(format(DATE, "%U"))) %>%
group_by(WEEK) %>%
summarise(total_sales = sum(UNIT_SALES)) %>%
ggplot(aes(x = WEEK, y = total_sales)) +
geom_line(color = "black") + # Blue line connecting points
labs(title = "Total Sales by Week of the Year",
x = "Week of the Year",
y = "Total Unit Sales") +
theme_minimal()
> GREETINGLE sells solidly from early spring (week 10) through
mid-summer (week 30) and starts a slow down through the latter part of
the year.
library(zoo)
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
# Calculate total sales for each group of 211 consecutive weeks (6 months)
sales_by_group <- filtered_df %>%
mutate(DATE = as.Date(DATE)) %>%
mutate(WEEK = as.integer(format(DATE, "%U"))) %>%
group_by(WEEK) %>%
summarise(total_sales = sum(UNIT_SALES)) %>%
mutate(sales_in_group = rollsum(total_sales, 21, align = "left", fill = NA)) %>%
mutate(week_label = paste0("Week ", WEEK + 1, " to Week ", WEEK + 21)) %>%
arrange(WEEK) %>% # Order by WEEK
filter(!is.na(sales_in_group)) # Remove rows with sales_in_group = NA
# Plot the bar chart
sales_by_group$week_label <- factor(sales_by_group$week_label, levels = sales_by_group$week_label[order(sales_by_group$WEEK)])
ggplot(sales_by_group, aes(x = factor(week_label), y = sales_in_group)) +
geom_bar(stat = "identity", fill = "black") +
labs(title = "Total Sales for Each 6-month Grouping",
x = "Weeks (Starting from Week 1)",
y = "Total Sales") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
> GREETINGLE shows a similar pattern across weekly groupings; that is
to say its solid in early spring through mid-summer then starts a slow
decline to end of year.
#find the best 21 weeks for Woodsy sales - WOODSY YELLOW has a space in the middle
# Calculate total sales for each group of 21 consecutive weeks
sales_by_woodsy <- df %>%
filter(str_detect(ITEM, "WOODSY YELLOW")) %>%
mutate(DATE = as.Date(DATE)) %>%
mutate(WEEK = as.integer(format(DATE, "%U"))) %>%
group_by(WEEK) %>%
summarise(total_sales = sum(UNIT_SALES)) %>%
mutate(sales_in_group = rollsum(total_sales, 21, align = "left", fill = NA)) %>%
mutate(week_label = paste0("Week ", WEEK + 1, " to Week ", WEEK + 21)) %>%
arrange(WEEK) %>% # Order by WEEK
filter(!is.na(sales_in_group)) # Remove rows with sales_in_group = NA
# Plot the bar chart
sales_by_woodsy$week_label <- factor(sales_by_woodsy$week_label, levels = sales_by_woodsy$week_label[order(sales_by_woodsy$WEEK)])
ggplot(sales_by_woodsy, aes(x = factor(week_label), y = sales_in_group)) +
geom_bar(stat = "identity", fill = "black") +
labs(title = "Total Sales for Each 21-Week Grouping",
x = "Weeks (Starting from Week 1)",
y = "Total Sales") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
> Woodsy Yellow sales are best in the 21 weeks from week 14 to
34.
#find the best 21 weeks for ING Enhanced Water, WOODSY, YELLOW
# Calculate total sales for each group of 21 consecutive weeks
sales_by_innovation <- df %>%
filter(CATEGORY == "ING ENHANCED WATER",
str_detect(ITEM, "WOODSY"),
str_detect(ITEM, "YELLOW")) %>%
mutate(DATE = as.Date(DATE)) %>%
mutate(WEEK = as.integer(format(DATE, "%U"))) %>%
group_by(WEEK) %>%
summarise(total_sales = sum(UNIT_SALES)) %>%
mutate(sales_in_group = rollsum(total_sales, 13, align = "left", fill = NA)) %>%
mutate(week_label = paste0("Week ", WEEK + 1, " to Week ", WEEK + 13)) %>%
arrange(WEEK) %>% # Order by WEEK
filter(!is.na(sales_in_group)) # Remove rows with sales_in_group = NA
# Plot the bar chart
sales_by_innovation$week_label <- factor(sales_by_innovation$week_label, levels = sales_by_innovation$week_label[order(sales_by_innovation$WEEK)])
ggplot(sales_by_innovation, aes(x = factor(week_label), y = sales_in_group)) +
geom_bar(stat = "identity", fill = "black") +
labs(title = "Total Sales for Each 13-Week Grouping",
x = "Weeks (Starting from Week 1)",
y = "Total Sales") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
A combination of the above shows again that most of the year is hot as things slowly taper off end of year and pick back up early spring through mid summer and hold solid.
# Making a new smaller "innovation" data frame
GREETINGLE_DF <- filtered_df
#create innovation based on ING ENHANCED WATER
innovation<- df %>%
filter(CATEGORY == "ING ENHANCED WATER",
str_detect(ITEM, "WOODSY"),
str_detect(ITEM, "YELLOW"))
#unique PACKAGE string from innovation
print(unique(innovation$PACKAGE))
## [1] ".5L 6ONE JUG" "20SMALL MULTI JUG" "26-32SMALL MLT"
## [4] ".5L MULTI JUG"
library(dplyr)
library(lubridate)
innovation <- innovation %>%
mutate(
MONTH = month(ymd(DATE)), # Extract month using lubridate's ymd function
MONTH = as.factor(MONTH) # Convert the extracted month into a factor
)
str(innovation)
## 'data.frame': 9642 obs. of 13 variables:
## $ MARKET_KEY : chr "969" "137" "1135" "924" ...
## $ DATE : chr "2022-05-14" "2021-01-09" "2022-04-16" "2021-07-03" ...
## $ CALORIC_SEGMENT: num 0 1 0 1 0 0 1 1 1 0 ...
## $ CATEGORY : chr "ING ENHANCED WATER" "ING ENHANCED WATER" "ING ENHANCED WATER" "ING ENHANCED WATER" ...
## $ UNIT_SALES : num 2 88 100 25 30 7 155 127 37 5 ...
## $ DOLLAR_SALES : num 8.98 102.28 107.17 44.71 118.26 ...
## $ MANUFACTURER : chr "COCOS" "COCOS" "COCOS" "COCOS" ...
## $ BRAND : chr "STRONGLY ENERGY WATER" "VITAMINAL FLOW" "STRONGLY ENERGY WATER" "VITAMINAL FLOW" ...
## $ PACKAGE : chr ".5L 6ONE JUG" "20SMALL MULTI JUG" "20SMALL MULTI JUG" "26-32SMALL MLT" ...
## $ ITEM : chr "STRONGLY ENERGY WATER IONIC WATER BEVERAGE WOODSY YELLOW JUG 16.9 LIQUID SMALL X6" "TITANS NOURISH WATER WATER BEVERAGE WOODSY YELLOW JUG 20 LIQUID SMALL" "STRONGLY ENERGY WATER IONIC WATER BEVERAGE WOODSY YELLOW JUG 20 LIQUID SMALL" "TITANS NOURISH WATER NUTRIENT ENHANCED WATER BVRG WOODSY YELLOW JUG 32 LIQUID SMALL" ...
## $ REGION : chr "SOCAL" "CALI_NEVADA" "PRAIRIE" "COLORADO" ...
## $ MONTH : Factor w/ 12 levels "1","2","3","4",..: 5 1 4 7 4 8 6 2 8 10 ...
## $ SEASON : chr "SPRING" "WINTER" "SPRING" "SUMMER" ...
print(unique(innovation$ITEM))
## [1] "STRONGLY ENERGY WATER IONIC WATER BEVERAGE WOODSY YELLOW JUG 16.9 LIQUID SMALL X6"
## [2] "TITANS NOURISH WATER WATER BEVERAGE WOODSY YELLOW JUG 20 LIQUID SMALL"
## [3] "STRONGLY ENERGY WATER IONIC WATER BEVERAGE WOODSY YELLOW JUG 20 LIQUID SMALL"
## [4] "TITANS NOURISH WATER NUTRIENT ENHANCED WATER BVRG WOODSY YELLOW JUG 32 LIQUID SMALL"
## [5] "STRONGLY ENERGY WATER IONIC WATER BEVERAGE WOODSY YELLOW JUG 16.9 LIQUID SMALL"
# Count the number of unique PACKAGE column of our sample
table(innovation$PACKAGE)
##
## .5L 6ONE JUG .5L MULTI JUG 20SMALL MULTI JUG 26-32SMALL MLT
## 1782 5 4926 2929
#This gives us an idea that there are only 4 package combinations with 7 sub-string elements
# Looking at the competitors within the 'innovation' data frame
model <- lm(UNIT_SALES ~ DOLLAR_SALES + CALORIC_SEGMENT + PACKAGE + SEASON + REGION, data = innovation)
summary(model)
##
## Call:
## lm(formula = UNIT_SALES ~ DOLLAR_SALES + CALORIC_SEGMENT + PACKAGE +
## SEASON + REGION, data = innovation)
##
## Residuals:
## Min 1Q Median 3Q Max
## -581.68 -11.32 -0.20 9.87 299.96
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -19.945521 0.932885 -21.380 < 2e-16 ***
## DOLLAR_SALES 0.582900 0.002442 238.652 < 2e-16 ***
## CALORIC_SEGMENT 9.483530 0.646691 14.665 < 2e-16 ***
## PACKAGE.5L MULTI JUG 15.711797 11.801624 1.331 0.183114
## PACKAGE20SMALL MULTI JUG 30.867445 0.824520 37.437 < 2e-16 ***
## PACKAGE26-32SMALL MLT -5.117949 0.905559 -5.652 1.63e-08 ***
## SEASONSPRING 1.993019 0.767041 2.598 0.009382 **
## SEASONSUMMER 3.395235 0.790524 4.295 1.76e-05 ***
## SEASONWINTER 2.345420 0.758544 3.092 0.001994 **
## REGIONCALI_NEVADA -0.591899 1.751842 -0.338 0.735467
## REGIONCOLORADO -1.508996 0.913527 -1.652 0.098601 .
## REGIONDESERT_SW 2.858842 1.044138 2.738 0.006193 **
## REGIONKANSAS -26.011662 2.915571 -8.922 < 2e-16 ***
## REGIONMOUNTAIN 5.292040 1.064355 4.972 6.74e-07 ***
## REGIONNEWMEXICO -5.049555 1.528053 -3.305 0.000955 ***
## REGIONNOCAL -4.243541 1.713116 -2.477 0.013263 *
## REGIONNORTHERN 8.757881 0.758761 11.542 < 2e-16 ***
## REGIONPRAIRIE 0.156771 1.938273 0.081 0.935538
## REGIONSOCAL -0.894532 1.228555 -0.728 0.466560
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 26.33 on 9623 degrees of freedom
## Multiple R-squared: 0.9088, Adjusted R-squared: 0.9086
## F-statistic: 5329 on 18 and 9623 DF, p-value: < 2.2e-16
# Looking at the competitors within the 'innovation' data frame
model <- lm(DOLLAR_SALES ~ UNIT_SALES + CALORIC_SEGMENT + PACKAGE + SEASON + REGION, data = innovation)
summary(model)
##
## Call:
## lm(formula = DOLLAR_SALES ~ UNIT_SALES + CALORIC_SEGMENT + PACKAGE +
## SEASON + REGION, data = innovation)
##
## Residuals:
## Min 1Q Median 3Q Max
## -368.48 -17.90 -2.46 12.49 1059.78
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.85905 1.46502 25.842 < 2e-16 ***
## UNIT_SALES 1.46760 0.00615 238.652 < 2e-16 ***
## CALORIC_SEGMENT -4.01836 1.03672 -3.876 0.000107 ***
## PACKAGE.5L MULTI JUG -29.13477 18.72549 -1.556 0.119768
## PACKAGE20SMALL MULTI JUG -37.01711 1.34853 -27.450 < 2e-16 ***
## PACKAGE26-32SMALL MLT 5.61164 1.43813 3.902 9.60e-05 ***
## SEASONSPRING -3.10088 1.21711 -2.548 0.010858 *
## SEASONSUMMER -3.66627 1.25500 -2.921 0.003494 **
## SEASONWINTER -6.84062 1.20219 -5.690 1.31e-08 ***
## REGIONCALI_NEVADA -7.32499 2.77873 -2.636 0.008400 **
## REGIONCOLORADO 0.41421 1.44973 0.286 0.775101
## REGIONDESERT_SW -7.63531 1.65559 -4.612 4.04e-06 ***
## REGIONKANSAS 98.20160 4.53620 21.648 < 2e-16 ***
## REGIONMOUNTAIN -10.17712 1.68784 -6.030 1.70e-09 ***
## REGIONNEWMEXICO -3.08763 2.42579 -1.273 0.203109
## REGIONNOCAL 4.15890 2.71881 1.530 0.126130
## REGIONNORTHERN -13.49698 1.20443 -11.206 < 2e-16 ***
## REGIONPRAIRIE 2.73801 3.07541 0.890 0.373332
## REGIONSOCAL -5.17125 1.94874 -2.654 0.007976 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 41.79 on 9623 degrees of freedom
## Multiple R-squared: 0.8958, Adjusted R-squared: 0.8956
## F-statistic: 4597 on 18 and 9623 DF, p-value: < 2.2e-16
Very interesting findings when unit sales and dollar sales are alternated as the target variable. The significant differences in REGIONKANSAS between the two models highlight a unique regional effect: while there’s a notable decrease in UNIT_SALES, suggesting lesser product volume sold in Kansas, DOLLAR_SALES significantly increase, indicating higher revenue per unit or preference for pricier products- or something like that. This discrepancy underscores distinct market dynamics in Kansas, where despite lower quantities, the revenue impact is markedly higher, reflecting unique consumer behavior or pricing strategies. Such findings suggest that in Kansas, market strategies should not solely focus on volume but also on the pricing and type of products offered. Conversely, the NORTHERN region coefficient presents contrasting effects in the two models: it significantly increases UNIT_SALES, indicating a higher volume of products sold in the Northern region, but it significantly decreases DOLLAR_SALES, implying lower revenue per unit or a preference for less expensive products. This contrast suggests unique market dynamics in the Northern region, where products are popular in terms of quantity but not necessarily in terms of revenue, possibly due to lower pricing or the sale of lower-priced items.
# Interesting shrinking of our df where we shrink our sales and volume, and see its effect on our GREETINGLE only.
small_group <- df %>%
filter(UNIT_SALES < 6000, DOLLAR_SALES < 12000)
#skim(small_group)
skim(df %>% filter(BRAND == "GREETINGLE"))
| Name | df %>% filter(BRAND == “G… |
| Number of rows | 48787 |
| Number of columns | 13 |
| _______________________ | |
| Column type frequency: | |
| character | 9 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| MARKET_KEY | 0 | 1 | 1 | 4 | 0 | 200 | 0 |
| DATE | 0 | 1 | 10 | 10 | 0 | 148 | 0 |
| CATEGORY | 0 | 1 | 18 | 18 | 0 | 1 | 0 |
| MANUFACTURER | 0 | 1 | 8 | 8 | 0 | 1 | 0 |
| BRAND | 0 | 1 | 10 | 10 | 0 | 1 | 0 |
| PACKAGE | 0 | 1 | 12 | 17 | 0 | 5 | 0 |
| ITEM | 0 | 1 | 60 | 116 | 0 | 36 | 0 |
| REGION | 0 | 1 | 5 | 11 | 0 | 11 | 0 |
| SEASON | 0 | 1 | 4 | 6 | 0 | 4 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| CALORIC_SEGMENT | 0 | 1 | 0.00 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 | ▁▁▇▁▁ |
| UNIT_SALES | 0 | 1 | 44.49 | 98.77 | 1.0 | 8.00 | 23.00 | 54.00 | 4616.00 | ▇▁▁▁▁ |
| DOLLAR_SALES | 0 | 1 | 117.67 | 284.42 | 0.3 | 26.94 | 64.71 | 131.82 | 11801.56 | ▇▁▁▁▁ |
| MONTH | 0 | 1 | 6.33 | 3.38 | 1.0 | 4.00 | 6.00 | 9.00 | 12.00 | ▇▆▆▅▇ |
# Create a scatter plot with the regression line, colored by MANUFACTURER
ggplot(small_group, aes(x = UNIT_SALES, y = DOLLAR_SALES, color = MANUFACTURER)) +
geom_point(alpha = 0.5) + # Adjust alpha to avoid overplotting, if necessary
geom_smooth(method = "lm", color = "black", se = FALSE) + # Add linear regression line without confidence band for clarity
labs(title = "Linear Model of UNIT_SALES vs. DOLLAR_SALES by MANUFACTURER",
x = "UNIT SALES",
y = "DOLLAR SALES") +
theme_minimal() +
theme(legend.position = "bottom")
## `geom_smooth()` using formula = 'y ~ x'
# Create a new data frame with only the rows where the ITEM column contains the words "woodsy yellow"
wy_small <- df[grep("woodsy yellow", df$ITEM, ignore.case = TRUE), ]
skim(wy_small)
| Name | wy_small |
| Number of rows | 9642 |
| Number of columns | 13 |
| _______________________ | |
| Column type frequency: | |
| character | 9 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| MARKET_KEY | 0 | 1 | 1 | 4 | 0 | 200 | 0 |
| DATE | 0 | 1 | 10 | 10 | 0 | 148 | 0 |
| CATEGORY | 0 | 1 | 18 | 18 | 0 | 1 | 0 |
| MANUFACTURER | 0 | 1 | 5 | 5 | 0 | 1 | 0 |
| BRAND | 0 | 1 | 14 | 21 | 0 | 2 | 0 |
| PACKAGE | 0 | 1 | 12 | 17 | 0 | 4 | 0 |
| ITEM | 0 | 1 | 71 | 85 | 0 | 5 | 0 |
| REGION | 0 | 1 | 5 | 11 | 0 | 11 | 0 |
| SEASON | 0 | 1 | 4 | 6 | 0 | 4 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| CALORIC_SEGMENT | 0 | 1 | 0.50 | 0.50 | 0.00 | 0.0 | 0.00 | 1.00 | 1.00 | ▇▁▁▁▇ |
| UNIT_SALES | 0 | 1 | 64.27 | 87.13 | 1.00 | 9.0 | 30.00 | 87.00 | 1259.00 | ▇▁▁▁▁ |
| DOLLAR_SALES | 0 | 1 | 105.30 | 129.34 | 0.75 | 23.5 | 59.85 | 139.99 | 1909.83 | ▇▁▁▁▁ |
| MONTH | 0 | 1 | 6.37 | 3.51 | 1.00 | 3.0 | 6.00 | 9.00 | 12.00 | ▇▆▅▅▇ |
Woodsy Yellow shows unit sales average of 64.27 and dollar sales average of $105.30. Compare that to 10% df 142.32 is unit sales and $465.11 for dollar sales. And Greetingle sales on average are 44.50 for unit sales and $117.67 dollar sales. This makes interesting ratios.
# wy small is dataframe - caloric segment and category removed since we have only one each
model <- lm(DOLLAR_SALES ~ UNIT_SALES + PACKAGE + CALORIC_SEGMENT + SEASON + REGION, data = wy_small)
summary(model)
##
## Call:
## lm(formula = DOLLAR_SALES ~ UNIT_SALES + PACKAGE + CALORIC_SEGMENT +
## SEASON + REGION, data = wy_small)
##
## Residuals:
## Min 1Q Median 3Q Max
## -368.48 -17.90 -2.46 12.49 1059.78
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.85905 1.46502 25.842 < 2e-16 ***
## UNIT_SALES 1.46760 0.00615 238.652 < 2e-16 ***
## PACKAGE.5L MULTI JUG -29.13477 18.72549 -1.556 0.119768
## PACKAGE20SMALL MULTI JUG -37.01711 1.34853 -27.450 < 2e-16 ***
## PACKAGE26-32SMALL MLT 5.61164 1.43813 3.902 9.60e-05 ***
## CALORIC_SEGMENT -4.01836 1.03672 -3.876 0.000107 ***
## SEASONSPRING -3.10088 1.21711 -2.548 0.010858 *
## SEASONSUMMER -3.66627 1.25500 -2.921 0.003494 **
## SEASONWINTER -6.84062 1.20219 -5.690 1.31e-08 ***
## REGIONCALI_NEVADA -7.32499 2.77873 -2.636 0.008400 **
## REGIONCOLORADO 0.41421 1.44973 0.286 0.775101
## REGIONDESERT_SW -7.63531 1.65559 -4.612 4.04e-06 ***
## REGIONKANSAS 98.20160 4.53620 21.648 < 2e-16 ***
## REGIONMOUNTAIN -10.17712 1.68784 -6.030 1.70e-09 ***
## REGIONNEWMEXICO -3.08763 2.42579 -1.273 0.203109
## REGIONNOCAL 4.15890 2.71881 1.530 0.126130
## REGIONNORTHERN -13.49698 1.20443 -11.206 < 2e-16 ***
## REGIONPRAIRIE 2.73801 3.07541 0.890 0.373332
## REGIONSOCAL -5.17125 1.94874 -2.654 0.007976 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 41.79 on 9623 degrees of freedom
## Multiple R-squared: 0.8958, Adjusted R-squared: 0.8956
## F-statistic: 4597 on 18 and 9623 DF, p-value: < 2.2e-16
# wy small is dataframe - caloric segment and category removed since we have only one each
model <- lm(UNIT_SALES ~ DOLLAR_SALES + PACKAGE + CALORIC_SEGMENT + SEASON + REGION, data = wy_small)
summary(model)
##
## Call:
## lm(formula = UNIT_SALES ~ DOLLAR_SALES + PACKAGE + CALORIC_SEGMENT +
## SEASON + REGION, data = wy_small)
##
## Residuals:
## Min 1Q Median 3Q Max
## -581.68 -11.32 -0.20 9.87 299.96
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -19.945521 0.932885 -21.380 < 2e-16 ***
## DOLLAR_SALES 0.582900 0.002442 238.652 < 2e-16 ***
## PACKAGE.5L MULTI JUG 15.711797 11.801624 1.331 0.183114
## PACKAGE20SMALL MULTI JUG 30.867445 0.824520 37.437 < 2e-16 ***
## PACKAGE26-32SMALL MLT -5.117949 0.905559 -5.652 1.63e-08 ***
## CALORIC_SEGMENT 9.483530 0.646691 14.665 < 2e-16 ***
## SEASONSPRING 1.993019 0.767041 2.598 0.009382 **
## SEASONSUMMER 3.395235 0.790524 4.295 1.76e-05 ***
## SEASONWINTER 2.345420 0.758544 3.092 0.001994 **
## REGIONCALI_NEVADA -0.591899 1.751842 -0.338 0.735467
## REGIONCOLORADO -1.508996 0.913527 -1.652 0.098601 .
## REGIONDESERT_SW 2.858842 1.044138 2.738 0.006193 **
## REGIONKANSAS -26.011662 2.915571 -8.922 < 2e-16 ***
## REGIONMOUNTAIN 5.292040 1.064355 4.972 6.74e-07 ***
## REGIONNEWMEXICO -5.049555 1.528053 -3.305 0.000955 ***
## REGIONNOCAL -4.243541 1.713116 -2.477 0.013263 *
## REGIONNORTHERN 8.757881 0.758761 11.542 < 2e-16 ***
## REGIONPRAIRIE 0.156771 1.938273 0.081 0.935538
## REGIONSOCAL -0.894532 1.228555 -0.728 0.466560
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 26.33 on 9623 degrees of freedom
## Multiple R-squared: 0.9088, Adjusted R-squared: 0.9086
## F-statistic: 5329 on 18 and 9623 DF, p-value: < 2.2e-16
The simple insights of a linear regression on the data make useful points. Lets see if an XGBOOST can do better.
# Load and prepare dataset, created separately for brevity derived from the wy_small but one hot encoded
df <- read.csv("woodsy_one_hot.csv")
df <- df %>%
select(-DATE, -MONTH)
# Summarize the dataset
skimr::skim(df)
| Name | df |
| Number of rows | 9642 |
| Number of columns | 30 |
| _______________________ | |
| Column type frequency: | |
| numeric | 30 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| CALORIC_SEGMENT | 0 | 1 | 0.50 | 0.50 | 0.00 | 0.0 | 0.00 | 1.00 | 1.00 | ▇▁▁▁▇ |
| UNIT_SALES | 0 | 1 | 64.27 | 87.13 | 1.00 | 9.0 | 30.00 | 87.00 | 1259.00 | ▇▁▁▁▁ |
| DOLLAR_SALES | 0 | 1 | 105.30 | 129.34 | 0.75 | 23.5 | 59.85 | 139.99 | 1909.83 | ▇▁▁▁▁ |
| POINT5L | 0 | 1 | 0.19 | 0.39 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▂ |
| X20SMALL | 0 | 1 | 0.51 | 0.50 | 0.00 | 0.0 | 1.00 | 1.00 | 1.00 | ▇▁▁▁▇ |
| X26.32SMALL | 0 | 1 | 0.30 | 0.46 | 0.00 | 0.0 | 0.00 | 1.00 | 1.00 | ▇▁▁▁▃ |
| X6ONE | 0 | 1 | 0.18 | 0.39 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▂ |
| JUG | 0 | 1 | 0.70 | 0.46 | 0.00 | 0.0 | 1.00 | 1.00 | 1.00 | ▃▁▁▁▇ |
| MLT | 0 | 1 | 0.30 | 0.46 | 0.00 | 0.0 | 0.00 | 1.00 | 1.00 | ▇▁▁▁▃ |
| MULTI | 0 | 1 | 0.51 | 0.50 | 0.00 | 0.0 | 1.00 | 1.00 | 1.00 | ▇▁▁▁▇ |
| BRAND_STRONGLY.ENERGY.WATER | 0 | 1 | 0.40 | 0.49 | 0.00 | 0.0 | 0.00 | 1.00 | 1.00 | ▇▁▁▁▅ |
| BRAND_VITAMINAL.FLOW | 0 | 1 | 0.60 | 0.49 | 0.00 | 0.0 | 1.00 | 1.00 | 1.00 | ▅▁▁▁▇ |
| REGION_ARIZONA | 0 | 1 | 0.29 | 0.45 | 0.00 | 0.0 | 0.00 | 1.00 | 1.00 | ▇▁▁▁▃ |
| REGION_CALI_NEVADA | 0 | 1 | 0.03 | 0.16 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| REGION_COLORADO | 0 | 1 | 0.13 | 0.33 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| REGION_DESERT_SW | 0 | 1 | 0.09 | 0.28 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| REGION_KANSAS | 0 | 1 | 0.01 | 0.10 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| REGION_MOUNTAIN | 0 | 1 | 0.08 | 0.28 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| REGION_NEWMEXICO | 0 | 1 | 0.04 | 0.19 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| REGION_NOCAL | 0 | 1 | 0.03 | 0.16 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| REGION_NORTHERN | 0 | 1 | 0.24 | 0.43 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▂ |
| REGION_PRAIRIE | 0 | 1 | 0.02 | 0.14 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| REGION_SOCAL | 0 | 1 | 0.06 | 0.24 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| SEASON_FALL | 0 | 1 | 0.23 | 0.42 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▂ |
| SEASON_SPRING | 0 | 1 | 0.26 | 0.44 | 0.00 | 0.0 | 0.00 | 1.00 | 1.00 | ▇▁▁▁▃ |
| SEASON_SUMMER | 0 | 1 | 0.23 | 0.42 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▂ |
| SEASON_WINTER | 0 | 1 | 0.28 | 0.45 | 0.00 | 0.0 | 0.00 | 1.00 | 1.00 | ▇▁▁▁▃ |
| ITEM_STRONGLY.IONIC.WATER.BEVERAGE.WOODSY.YELLOW. | 0 | 1 | 0.40 | 0.49 | 0.00 | 0.0 | 0.00 | 1.00 | 1.00 | ▇▁▁▁▅ |
| ITEM_TITANS.NOURISH.WATER.BEVERAGE.WOODSY.YELLOW. | 0 | 1 | 0.30 | 0.46 | 0.00 | 0.0 | 0.00 | 1.00 | 1.00 | ▇▁▁▁▃ |
| ITEM_TITANS.NOURISH.WATER.NUTRIENT.ENHANCED.BVRG.WOODSY.YELLOW. | 0 | 1 | 0.30 | 0.46 | 0.00 | 0.0 | 0.00 | 1.00 | 1.00 | ▇▁▁▁▃ |
# Split the data
set.seed(123)
df_testtrn <- initial_split(df, prop = 0.8, strata = UNIT_SALES)
Train <- training(df_testtrn)
Test <- testing(df_testtrn)
# Prepare features and labels for XGBoost
train_features <- Train[, -which(names(Train) == "UNIT_SALES")]
train_labels <- Train$UNIT_SALES
test_features <- Test[, -which(names(Test) == "UNIT_SALES")]
test_labels <- Test$UNIT_SALES
# Convert data to DMatrix format
dtrain <- xgb.DMatrix(data = as.matrix(train_features), label = train_labels)
dtest <- xgb.DMatrix(data = as.matrix(test_features), label = test_labels)
# Define XGBoost parameters
set.seed(123)
params <- list(
booster = "gbtree",
objective = "reg:squarederror",
eval_metric = "rmse",
eta = 0.05,
max_depth = 4,
min_child_weight = 3,
subsample = 0.7,
colsample_bytree = 0.6,
lambda = 1,
alpha = 1
)
# Perform cross-validation to find the optimal number of boosting rounds
cv_results <- xgb.cv(
params = params,
data = dtrain,
nfold = 5,
nrounds = 500, # Changed from 'num_boost_round' to 'nrounds'
early_stopping_rounds = 10,
metrics = "rmse",
seed = 123
)
## [1] train-rmse:104.438734+1.862060 test-rmse:104.277497+5.560981
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 10 rounds.
##
## [2] train-rmse:100.481420+1.828578 test-rmse:100.300344+5.635670
## [3] train-rmse:97.437686+1.724683 test-rmse:97.328994+5.723550
## [4] train-rmse:93.509634+1.516900 test-rmse:93.421599+5.714666
## [5] train-rmse:90.528974+1.683243 test-rmse:90.421465+5.492339
## [6] train-rmse:87.285081+1.895664 test-rmse:87.148105+5.487865
## [7] train-rmse:83.916617+1.649719 test-rmse:83.807786+5.331198
## [8] train-rmse:80.411010+1.590570 test-rmse:80.328766+5.610827
## [9] train-rmse:77.374544+2.002057 test-rmse:77.308995+5.784253
## [10] train-rmse:73.891305+1.918699 test-rmse:73.835653+5.606312
## [11] train-rmse:70.844607+1.608307 test-rmse:70.803943+5.597310
## [12] train-rmse:67.984042+1.731115 test-rmse:67.980780+5.192038
## [13] train-rmse:65.292031+1.919502 test-rmse:65.301321+5.581004
## [14] train-rmse:62.984255+2.370586 test-rmse:62.942382+5.732470
## [15] train-rmse:60.817700+2.864825 test-rmse:60.725003+5.909879
## [16] train-rmse:58.408678+2.591975 test-rmse:58.346380+5.762669
## [17] train-rmse:56.160713+2.491385 test-rmse:56.111212+5.433884
## [18] train-rmse:53.793769+2.389017 test-rmse:53.760829+5.327687
## [19] train-rmse:51.870093+1.921023 test-rmse:51.874732+5.251952
## [20] train-rmse:50.175292+1.834348 test-rmse:50.209783+5.044398
## [21] train-rmse:48.508652+1.584712 test-rmse:48.595674+4.824019
## [22] train-rmse:46.713480+1.247490 test-rmse:46.832437+4.790795
## [23] train-rmse:44.834882+1.194239 test-rmse:44.955717+4.731072
## [24] train-rmse:43.469042+1.435777 test-rmse:43.591795+4.865448
## [25] train-rmse:42.178935+1.701390 test-rmse:42.312723+4.991194
## [26] train-rmse:41.142637+1.894216 test-rmse:41.310577+5.191422
## [27] train-rmse:39.932724+1.767986 test-rmse:40.107775+5.006327
## [28] train-rmse:38.460044+1.683877 test-rmse:38.647515+4.945043
## [29] train-rmse:37.404323+1.584751 test-rmse:37.637056+5.158217
## [30] train-rmse:36.231962+1.617828 test-rmse:36.461994+4.950549
## [31] train-rmse:35.410486+1.657499 test-rmse:35.630969+4.642916
## [32] train-rmse:34.653119+1.777648 test-rmse:34.869303+4.383160
## [33] train-rmse:33.592384+1.479384 test-rmse:33.876115+4.334526
## [34] train-rmse:32.754220+1.448990 test-rmse:33.055765+4.231181
## [35] train-rmse:31.948008+1.413150 test-rmse:32.275954+4.064136
## [36] train-rmse:31.362818+1.443906 test-rmse:31.710906+4.100751
## [37] train-rmse:30.840865+1.212678 test-rmse:31.260238+4.201705
## [38] train-rmse:30.129323+1.192861 test-rmse:30.612032+4.454154
## [39] train-rmse:29.458635+1.349180 test-rmse:29.953338+4.216062
## [40] train-rmse:28.711794+1.382303 test-rmse:29.228291+4.378585
## [41] train-rmse:28.088652+1.242652 test-rmse:28.608545+4.187630
## [42] train-rmse:27.620946+1.322708 test-rmse:28.179892+4.012197
## [43] train-rmse:27.192269+1.353644 test-rmse:27.772997+4.060892
## [44] train-rmse:26.761063+1.316534 test-rmse:27.338458+4.096274
## [45] train-rmse:26.351141+1.189527 test-rmse:26.948092+3.984218
## [46] train-rmse:25.954594+1.139868 test-rmse:26.606237+3.900324
## [47] train-rmse:25.498486+1.086999 test-rmse:26.192822+4.088299
## [48] train-rmse:25.151940+1.054015 test-rmse:25.889674+4.029382
## [49] train-rmse:24.630594+1.023642 test-rmse:25.407511+3.922890
## [50] train-rmse:24.338249+1.085421 test-rmse:25.130405+3.959537
## [51] train-rmse:23.870611+0.987387 test-rmse:24.732425+3.965864
## [52] train-rmse:23.516537+1.056110 test-rmse:24.406818+3.867239
## [53] train-rmse:23.186704+1.153537 test-rmse:24.095073+3.816563
## [54] train-rmse:22.846258+1.087059 test-rmse:23.764960+3.840559
## [55] train-rmse:22.465104+1.059315 test-rmse:23.425253+3.738883
## [56] train-rmse:22.164082+0.995135 test-rmse:23.153606+3.705256
## [57] train-rmse:22.013707+0.887810 test-rmse:23.042239+3.748525
## [58] train-rmse:21.801110+0.873600 test-rmse:22.884238+3.833518
## [59] train-rmse:21.597672+0.908496 test-rmse:22.686907+3.670807
## [60] train-rmse:21.356470+0.926316 test-rmse:22.483457+3.654897
## [61] train-rmse:21.189974+0.906270 test-rmse:22.340283+3.651477
## [62] train-rmse:20.855675+0.874958 test-rmse:22.057103+3.609503
## [63] train-rmse:20.673638+0.736343 test-rmse:21.929218+3.723640
## [64] train-rmse:20.528548+0.808443 test-rmse:21.807281+3.651506
## [65] train-rmse:20.384607+0.889326 test-rmse:21.689961+3.582942
## [66] train-rmse:20.121957+0.918966 test-rmse:21.467183+3.520708
## [67] train-rmse:20.022982+0.849674 test-rmse:21.404976+3.562103
## [68] train-rmse:19.736924+0.826577 test-rmse:21.177375+3.531546
## [69] train-rmse:19.609901+0.787196 test-rmse:21.070057+3.482403
## [70] train-rmse:19.434451+0.778837 test-rmse:20.909645+3.378706
## [71] train-rmse:19.337069+0.838936 test-rmse:20.831648+3.347554
## [72] train-rmse:19.194948+0.925278 test-rmse:20.715044+3.273962
## [73] train-rmse:18.985850+0.863570 test-rmse:20.548627+3.244418
## [74] train-rmse:18.795053+0.825993 test-rmse:20.417508+3.274105
## [75] train-rmse:18.655923+0.848086 test-rmse:20.289060+3.290978
## [76] train-rmse:18.571538+0.908151 test-rmse:20.213214+3.263608
## [77] train-rmse:18.447161+0.910371 test-rmse:20.106792+3.268641
## [78] train-rmse:18.286200+0.925715 test-rmse:19.967093+3.249646
## [79] train-rmse:18.237222+0.953398 test-rmse:19.902386+3.196439
## [80] train-rmse:18.129728+1.024957 test-rmse:19.783508+3.114773
## [81] train-rmse:18.048264+1.005337 test-rmse:19.719807+3.117816
## [82] train-rmse:17.978597+1.040892 test-rmse:19.660296+3.069657
## [83] train-rmse:17.815969+1.011932 test-rmse:19.553658+3.050942
## [84] train-rmse:17.745431+0.990408 test-rmse:19.500092+3.079763
## [85] train-rmse:17.588155+0.959896 test-rmse:19.349590+3.034137
## [86] train-rmse:17.468568+0.922729 test-rmse:19.239589+3.035951
## [87] train-rmse:17.352957+0.943934 test-rmse:19.155171+2.993407
## [88] train-rmse:17.234515+0.888955 test-rmse:19.059278+2.996461
## [89] train-rmse:17.134723+0.831915 test-rmse:18.989274+3.025258
## [90] train-rmse:17.054714+0.823112 test-rmse:18.930607+3.014367
## [91] train-rmse:16.933757+0.797446 test-rmse:18.838139+2.967221
## [92] train-rmse:16.810251+0.780437 test-rmse:18.763924+2.932832
## [93] train-rmse:16.726257+0.747603 test-rmse:18.709282+2.959513
## [94] train-rmse:16.714728+0.749081 test-rmse:18.700485+2.957348
## [95] train-rmse:16.660857+0.782787 test-rmse:18.643118+2.913966
## [96] train-rmse:16.601409+0.793485 test-rmse:18.603825+2.923953
## [97] train-rmse:16.515798+0.767036 test-rmse:18.544441+2.948573
## [98] train-rmse:16.505076+0.769486 test-rmse:18.549169+2.961398
## [99] train-rmse:16.441901+0.764679 test-rmse:18.480205+2.919471
## [100] train-rmse:16.344944+0.733505 test-rmse:18.413179+2.907366
## [101] train-rmse:16.297603+0.742303 test-rmse:18.401034+2.895656
## [102] train-rmse:16.248525+0.697621 test-rmse:18.381703+2.924107
## [103] train-rmse:16.170833+0.672227 test-rmse:18.316188+2.919284
## [104] train-rmse:16.125601+0.672360 test-rmse:18.289758+2.922481
## [105] train-rmse:16.091355+0.655406 test-rmse:18.267266+2.932754
## [106] train-rmse:16.039416+0.685450 test-rmse:18.230667+2.895250
## [107] train-rmse:15.985061+0.684923 test-rmse:18.180940+2.869864
## [108] train-rmse:15.891629+0.674833 test-rmse:18.103950+2.862079
## [109] train-rmse:15.831041+0.658196 test-rmse:18.065871+2.830201
## [110] train-rmse:15.780720+0.656519 test-rmse:18.038937+2.813263
## [111] train-rmse:15.735003+0.656146 test-rmse:17.996532+2.793725
## [112] train-rmse:15.686032+0.629558 test-rmse:17.963634+2.824241
## [113] train-rmse:15.660607+0.648229 test-rmse:17.936710+2.808686
## [114] train-rmse:15.624948+0.626596 test-rmse:17.930227+2.812213
## [115] train-rmse:15.586873+0.614094 test-rmse:17.919101+2.809552
## [116] train-rmse:15.553331+0.594898 test-rmse:17.901181+2.799248
## [117] train-rmse:15.526145+0.583372 test-rmse:17.890280+2.809752
## [118] train-rmse:15.493679+0.591669 test-rmse:17.834781+2.790416
## [119] train-rmse:15.457788+0.587299 test-rmse:17.792152+2.771713
## [120] train-rmse:15.428371+0.588751 test-rmse:17.769917+2.780026
## [121] train-rmse:15.378581+0.571580 test-rmse:17.729462+2.773104
## [122] train-rmse:15.326485+0.562833 test-rmse:17.704956+2.765354
## [123] train-rmse:15.284430+0.558907 test-rmse:17.673536+2.754176
## [124] train-rmse:15.257213+0.552090 test-rmse:17.656544+2.773701
## [125] train-rmse:15.217490+0.532966 test-rmse:17.637657+2.763226
## [126] train-rmse:15.185554+0.528980 test-rmse:17.604966+2.750241
## [127] train-rmse:15.152059+0.532212 test-rmse:17.573249+2.733734
## [128] train-rmse:15.132156+0.525671 test-rmse:17.563147+2.743549
## [129] train-rmse:15.092330+0.527562 test-rmse:17.521109+2.710975
## [130] train-rmse:15.049512+0.519227 test-rmse:17.494369+2.684498
## [131] train-rmse:15.030557+0.519551 test-rmse:17.484881+2.685217
## [132] train-rmse:14.992615+0.511915 test-rmse:17.461533+2.666484
## [133] train-rmse:14.957904+0.512549 test-rmse:17.425813+2.628257
## [134] train-rmse:14.927430+0.526488 test-rmse:17.402691+2.614843
## [135] train-rmse:14.890625+0.517635 test-rmse:17.367595+2.607615
## [136] train-rmse:14.849312+0.520482 test-rmse:17.320028+2.566878
## [137] train-rmse:14.832193+0.527986 test-rmse:17.285083+2.539817
## [138] train-rmse:14.814702+0.524117 test-rmse:17.274416+2.524579
## [139] train-rmse:14.795555+0.517156 test-rmse:17.267425+2.519916
## [140] train-rmse:14.781310+0.514772 test-rmse:17.264531+2.512888
## [141] train-rmse:14.754313+0.520688 test-rmse:17.254561+2.505460
## [142] train-rmse:14.722536+0.510869 test-rmse:17.231961+2.495239
## [143] train-rmse:14.713838+0.512050 test-rmse:17.223690+2.498749
## [144] train-rmse:14.693744+0.503557 test-rmse:17.221588+2.496939
## [145] train-rmse:14.674039+0.507908 test-rmse:17.194092+2.482204
## [146] train-rmse:14.643539+0.513647 test-rmse:17.174214+2.467890
## [147] train-rmse:14.622975+0.522515 test-rmse:17.151297+2.446888
## [148] train-rmse:14.607763+0.522229 test-rmse:17.150460+2.442855
## [149] train-rmse:14.578363+0.523974 test-rmse:17.129361+2.438871
## [150] train-rmse:14.557407+0.513715 test-rmse:17.111304+2.452278
## [151] train-rmse:14.539416+0.508157 test-rmse:17.116079+2.426605
## [152] train-rmse:14.527219+0.499899 test-rmse:17.104953+2.435567
## [153] train-rmse:14.500933+0.495637 test-rmse:17.095397+2.434107
## [154] train-rmse:14.478177+0.495512 test-rmse:17.065030+2.422896
## [155] train-rmse:14.462208+0.487955 test-rmse:17.049530+2.425708
## [156] train-rmse:14.450373+0.492076 test-rmse:17.041880+2.434817
## [157] train-rmse:14.435269+0.500201 test-rmse:17.023472+2.418954
## [158] train-rmse:14.425387+0.501555 test-rmse:17.016158+2.426202
## [159] train-rmse:14.397187+0.498546 test-rmse:16.977926+2.415722
## [160] train-rmse:14.380515+0.496023 test-rmse:16.987451+2.409989
## [161] train-rmse:14.360991+0.500249 test-rmse:16.969557+2.400322
## [162] train-rmse:14.341098+0.494912 test-rmse:16.980213+2.390146
## [163] train-rmse:14.326968+0.494387 test-rmse:16.964125+2.388002
## [164] train-rmse:14.306137+0.507221 test-rmse:16.945581+2.373671
## [165] train-rmse:14.280262+0.498977 test-rmse:16.924808+2.383835
## [166] train-rmse:14.267403+0.498749 test-rmse:16.923469+2.374308
## [167] train-rmse:14.252839+0.493206 test-rmse:16.912415+2.369978
## [168] train-rmse:14.232668+0.486512 test-rmse:16.907730+2.374981
## [169] train-rmse:14.216681+0.481075 test-rmse:16.905304+2.372007
## [170] train-rmse:14.203565+0.483777 test-rmse:16.892802+2.368982
## [171] train-rmse:14.187573+0.483133 test-rmse:16.876753+2.363466
## [172] train-rmse:14.177369+0.488876 test-rmse:16.866253+2.355901
## [173] train-rmse:14.167537+0.492677 test-rmse:16.853597+2.344846
## [174] train-rmse:14.147822+0.492666 test-rmse:16.845005+2.350116
## [175] train-rmse:14.131817+0.482586 test-rmse:16.844079+2.352674
## [176] train-rmse:14.114651+0.477985 test-rmse:16.832447+2.344551
## [177] train-rmse:14.105368+0.482275 test-rmse:16.815290+2.331287
## [178] train-rmse:14.090013+0.479955 test-rmse:16.803516+2.327895
## [179] train-rmse:14.069659+0.473124 test-rmse:16.793177+2.314566
## [180] train-rmse:14.056932+0.475274 test-rmse:16.791821+2.319237
## [181] train-rmse:14.042806+0.474747 test-rmse:16.791090+2.314989
## [182] train-rmse:14.034702+0.474267 test-rmse:16.784820+2.314451
## [183] train-rmse:14.019723+0.473771 test-rmse:16.777349+2.302758
## [184] train-rmse:14.007086+0.470886 test-rmse:16.768306+2.293692
## [185] train-rmse:13.985844+0.468720 test-rmse:16.745902+2.299659
## [186] train-rmse:13.971545+0.468518 test-rmse:16.729957+2.294786
## [187] train-rmse:13.960925+0.465525 test-rmse:16.726985+2.293642
## [188] train-rmse:13.944976+0.469609 test-rmse:16.705727+2.296773
## [189] train-rmse:13.930510+0.462138 test-rmse:16.691392+2.292405
## [190] train-rmse:13.918630+0.460196 test-rmse:16.694666+2.288677
## [191] train-rmse:13.910202+0.462573 test-rmse:16.687162+2.286327
## [192] train-rmse:13.901193+0.464415 test-rmse:16.676747+2.268836
## [193] train-rmse:13.887317+0.470784 test-rmse:16.665316+2.260475
## [194] train-rmse:13.881275+0.471498 test-rmse:16.662690+2.261481
## [195] train-rmse:13.865173+0.471051 test-rmse:16.656386+2.258581
## [196] train-rmse:13.853303+0.469909 test-rmse:16.656032+2.253072
## [197] train-rmse:13.837375+0.473990 test-rmse:16.642306+2.241215
## [198] train-rmse:13.826791+0.468159 test-rmse:16.639687+2.243486
## [199] train-rmse:13.814844+0.471842 test-rmse:16.626822+2.235250
## [200] train-rmse:13.801977+0.467224 test-rmse:16.622494+2.239977
## [201] train-rmse:13.789906+0.470116 test-rmse:16.609270+2.241174
## [202] train-rmse:13.784012+0.472192 test-rmse:16.600250+2.235053
## [203] train-rmse:13.771288+0.471897 test-rmse:16.595371+2.229925
## [204] train-rmse:13.762517+0.469096 test-rmse:16.589808+2.234627
## [205] train-rmse:13.752040+0.469583 test-rmse:16.580843+2.242031
## [206] train-rmse:13.738587+0.477469 test-rmse:16.570398+2.225584
## [207] train-rmse:13.727594+0.472819 test-rmse:16.555904+2.222729
## [208] train-rmse:13.711985+0.475755 test-rmse:16.556365+2.213979
## [209] train-rmse:13.700847+0.474730 test-rmse:16.554377+2.218739
## [210] train-rmse:13.688887+0.466551 test-rmse:16.552167+2.216170
## [211] train-rmse:13.681878+0.464355 test-rmse:16.548715+2.216223
## [212] train-rmse:13.675945+0.461875 test-rmse:16.550040+2.213432
## [213] train-rmse:13.670460+0.463322 test-rmse:16.541081+2.207301
## [214] train-rmse:13.665846+0.463495 test-rmse:16.536567+2.209668
## [215] train-rmse:13.657334+0.464081 test-rmse:16.537049+2.207817
## [216] train-rmse:13.644747+0.466401 test-rmse:16.542992+2.204895
## [217] train-rmse:13.633002+0.460672 test-rmse:16.535587+2.208448
## [218] train-rmse:13.628133+0.461541 test-rmse:16.533015+2.209445
## [219] train-rmse:13.615651+0.464630 test-rmse:16.537368+2.191196
## [220] train-rmse:13.603342+0.464110 test-rmse:16.534401+2.196200
## [221] train-rmse:13.595664+0.464983 test-rmse:16.523908+2.186935
## [222] train-rmse:13.584610+0.461034 test-rmse:16.523848+2.198028
## [223] train-rmse:13.574949+0.464229 test-rmse:16.512662+2.194007
## [224] train-rmse:13.562371+0.460056 test-rmse:16.515580+2.192873
## [225] train-rmse:13.557730+0.460536 test-rmse:16.517134+2.187590
## [226] train-rmse:13.555253+0.460191 test-rmse:16.518044+2.182772
## [227] train-rmse:13.542656+0.458570 test-rmse:16.510913+2.178563
## [228] train-rmse:13.537943+0.458742 test-rmse:16.509326+2.175638
## [229] train-rmse:13.525734+0.459214 test-rmse:16.504077+2.172963
## [230] train-rmse:13.520411+0.455094 test-rmse:16.506975+2.164005
## [231] train-rmse:13.505627+0.455267 test-rmse:16.504907+2.160896
## [232] train-rmse:13.487082+0.448739 test-rmse:16.490809+2.155966
## [233] train-rmse:13.472404+0.448131 test-rmse:16.483401+2.151320
## [234] train-rmse:13.464612+0.450453 test-rmse:16.473264+2.147695
## [235] train-rmse:13.456156+0.455323 test-rmse:16.462233+2.138827
## [236] train-rmse:13.446421+0.454984 test-rmse:16.461894+2.134137
## [237] train-rmse:13.436567+0.448348 test-rmse:16.450457+2.141933
## [238] train-rmse:13.427033+0.448586 test-rmse:16.448161+2.142207
## [239] train-rmse:13.421628+0.448870 test-rmse:16.437279+2.135932
## [240] train-rmse:13.415139+0.447142 test-rmse:16.433787+2.138403
## [241] train-rmse:13.405994+0.447757 test-rmse:16.422599+2.130371
## [242] train-rmse:13.401769+0.449024 test-rmse:16.418126+2.133826
## [243] train-rmse:13.392894+0.454349 test-rmse:16.419606+2.142544
## [244] train-rmse:13.384944+0.457288 test-rmse:16.398016+2.140802
## [245] train-rmse:13.378389+0.459539 test-rmse:16.391605+2.135926
## [246] train-rmse:13.368225+0.459995 test-rmse:16.372633+2.136820
## [247] train-rmse:13.362491+0.461567 test-rmse:16.372823+2.135636
## [248] train-rmse:13.354589+0.461275 test-rmse:16.370292+2.124697
## [249] train-rmse:13.345956+0.463001 test-rmse:16.372595+2.107308
## [250] train-rmse:13.340325+0.461596 test-rmse:16.376694+2.106416
## [251] train-rmse:13.328008+0.461169 test-rmse:16.377023+2.104220
## [252] train-rmse:13.316365+0.464084 test-rmse:16.369592+2.101523
## [253] train-rmse:13.304190+0.461908 test-rmse:16.371876+2.109808
## [254] train-rmse:13.295709+0.466235 test-rmse:16.368368+2.104190
## [255] train-rmse:13.289128+0.469828 test-rmse:16.361523+2.099640
## [256] train-rmse:13.277624+0.468736 test-rmse:16.353057+2.101868
## [257] train-rmse:13.262142+0.466771 test-rmse:16.352246+2.095000
## [258] train-rmse:13.253695+0.461031 test-rmse:16.354334+2.093470
## [259] train-rmse:13.244758+0.459858 test-rmse:16.347714+2.095883
## [260] train-rmse:13.236525+0.452644 test-rmse:16.346155+2.095875
## [261] train-rmse:13.225894+0.451559 test-rmse:16.340836+2.097602
## [262] train-rmse:13.219438+0.455331 test-rmse:16.340812+2.097685
## [263] train-rmse:13.216804+0.455821 test-rmse:16.340736+2.099128
## [264] train-rmse:13.209374+0.454207 test-rmse:16.334024+2.103435
## [265] train-rmse:13.201660+0.451717 test-rmse:16.325480+2.100409
## [266] train-rmse:13.192049+0.454753 test-rmse:16.316149+2.097759
## [267] train-rmse:13.186119+0.454528 test-rmse:16.312766+2.102506
## [268] train-rmse:13.177428+0.452095 test-rmse:16.311884+2.102667
## [269] train-rmse:13.172563+0.452641 test-rmse:16.313789+2.099709
## [270] train-rmse:13.166367+0.457214 test-rmse:16.313426+2.100592
## [271] train-rmse:13.160861+0.456684 test-rmse:16.313137+2.095959
## [272] train-rmse:13.156450+0.454422 test-rmse:16.309204+2.093483
## [273] train-rmse:13.150445+0.454050 test-rmse:16.300855+2.095696
## [274] train-rmse:13.144804+0.451882 test-rmse:16.300345+2.093045
## [275] train-rmse:13.137621+0.453050 test-rmse:16.306244+2.091661
## [276] train-rmse:13.124538+0.450074 test-rmse:16.295264+2.092885
## [277] train-rmse:13.120435+0.451378 test-rmse:16.283753+2.090396
## [278] train-rmse:13.114510+0.450750 test-rmse:16.277290+2.090243
## [279] train-rmse:13.106417+0.449958 test-rmse:16.268983+2.086184
## [280] train-rmse:13.095379+0.449009 test-rmse:16.265088+2.081229
## [281] train-rmse:13.087258+0.451732 test-rmse:16.265951+2.072315
## [282] train-rmse:13.080428+0.455952 test-rmse:16.261627+2.066940
## [283] train-rmse:13.071710+0.454432 test-rmse:16.269919+2.067399
## [284] train-rmse:13.063776+0.459262 test-rmse:16.265101+2.070686
## [285] train-rmse:13.055637+0.460556 test-rmse:16.263218+2.061350
## [286] train-rmse:13.046682+0.454835 test-rmse:16.261486+2.067148
## [287] train-rmse:13.044498+0.455190 test-rmse:16.255096+2.061921
## [288] train-rmse:13.037150+0.458330 test-rmse:16.247475+2.055720
## [289] train-rmse:13.029506+0.461040 test-rmse:16.240270+2.048183
## [290] train-rmse:13.022403+0.463217 test-rmse:16.241232+2.040617
## [291] train-rmse:13.013692+0.464565 test-rmse:16.239338+2.045200
## [292] train-rmse:13.008839+0.466973 test-rmse:16.236719+2.045404
## [293] train-rmse:13.001465+0.464228 test-rmse:16.237507+2.045497
## [294] train-rmse:12.993142+0.462573 test-rmse:16.238162+2.054568
## [295] train-rmse:12.990091+0.462408 test-rmse:16.233210+2.051251
## [296] train-rmse:12.985127+0.459575 test-rmse:16.237067+2.045904
## [297] train-rmse:12.974659+0.462671 test-rmse:16.233790+2.055797
## [298] train-rmse:12.966478+0.459746 test-rmse:16.233379+2.064253
## [299] train-rmse:12.958155+0.459624 test-rmse:16.228299+2.058703
## [300] train-rmse:12.953615+0.461868 test-rmse:16.228533+2.060168
## [301] train-rmse:12.948981+0.463969 test-rmse:16.227232+2.063243
## [302] train-rmse:12.938451+0.464716 test-rmse:16.220514+2.054939
## [303] train-rmse:12.935307+0.464082 test-rmse:16.222507+2.053072
## [304] train-rmse:12.929210+0.461004 test-rmse:16.217026+2.053870
## [305] train-rmse:12.922853+0.463479 test-rmse:16.211919+2.050662
## [306] train-rmse:12.913915+0.466547 test-rmse:16.201673+2.044554
## [307] train-rmse:12.904853+0.467955 test-rmse:16.200788+2.046846
## [308] train-rmse:12.898973+0.468972 test-rmse:16.199649+2.050440
## [309] train-rmse:12.889548+0.467185 test-rmse:16.197351+2.055083
## [310] train-rmse:12.885032+0.464491 test-rmse:16.199704+2.047165
## [311] train-rmse:12.878840+0.466638 test-rmse:16.203905+2.044555
## [312] train-rmse:12.873646+0.465853 test-rmse:16.199739+2.040475
## [313] train-rmse:12.869119+0.462375 test-rmse:16.193914+2.038245
## [314] train-rmse:12.860591+0.462072 test-rmse:16.195541+2.037874
## [315] train-rmse:12.856111+0.461359 test-rmse:16.193572+2.038414
## [316] train-rmse:12.847521+0.462574 test-rmse:16.189207+2.039754
## [317] train-rmse:12.839740+0.461837 test-rmse:16.189437+2.037199
## [318] train-rmse:12.834369+0.458098 test-rmse:16.178224+2.039209
## [319] train-rmse:12.831434+0.457439 test-rmse:16.175599+2.039497
## [320] train-rmse:12.824470+0.455933 test-rmse:16.173334+2.037043
## [321] train-rmse:12.818762+0.453683 test-rmse:16.167154+2.043261
## [322] train-rmse:12.813258+0.454211 test-rmse:16.162203+2.044570
## [323] train-rmse:12.804402+0.449828 test-rmse:16.153828+2.048606
## [324] train-rmse:12.796005+0.448673 test-rmse:16.148750+2.047196
## [325] train-rmse:12.790944+0.448423 test-rmse:16.143172+2.053714
## [326] train-rmse:12.785119+0.444478 test-rmse:16.145898+2.050241
## [327] train-rmse:12.782607+0.442968 test-rmse:16.146607+2.048612
## [328] train-rmse:12.776804+0.437814 test-rmse:16.142835+2.054919
## [329] train-rmse:12.773195+0.437889 test-rmse:16.145308+2.055609
## [330] train-rmse:12.769137+0.437086 test-rmse:16.142038+2.057672
## [331] train-rmse:12.763224+0.434987 test-rmse:16.137997+2.062659
## [332] train-rmse:12.754114+0.433507 test-rmse:16.133295+2.066446
## [333] train-rmse:12.742938+0.431058 test-rmse:16.133635+2.061672
## [334] train-rmse:12.735674+0.426591 test-rmse:16.128568+2.057744
## [335] train-rmse:12.727314+0.424125 test-rmse:16.117428+2.058138
## [336] train-rmse:12.722519+0.425146 test-rmse:16.115464+2.059523
## [337] train-rmse:12.718592+0.423433 test-rmse:16.113945+2.060360
## [338] train-rmse:12.708732+0.423973 test-rmse:16.107391+2.062722
## [339] train-rmse:12.703139+0.428314 test-rmse:16.099656+2.057010
## [340] train-rmse:12.696273+0.430551 test-rmse:16.098899+2.051290
## [341] train-rmse:12.692653+0.433168 test-rmse:16.097738+2.051034
## [342] train-rmse:12.685670+0.428162 test-rmse:16.100797+2.042658
## [343] train-rmse:12.676607+0.430749 test-rmse:16.100906+2.041522
## [344] train-rmse:12.670462+0.430811 test-rmse:16.103498+2.049300
## [345] train-rmse:12.668985+0.430613 test-rmse:16.106273+2.053040
## [346] train-rmse:12.665708+0.433154 test-rmse:16.100660+2.048147
## [347] train-rmse:12.661600+0.432155 test-rmse:16.095354+2.050415
## [348] train-rmse:12.657759+0.432370 test-rmse:16.096956+2.048114
## [349] train-rmse:12.651147+0.433211 test-rmse:16.098734+2.045538
## [350] train-rmse:12.646782+0.435169 test-rmse:16.093733+2.043561
## [351] train-rmse:12.642262+0.437834 test-rmse:16.088970+2.042331
## [352] train-rmse:12.638031+0.439127 test-rmse:16.091049+2.045217
## [353] train-rmse:12.634157+0.440567 test-rmse:16.096093+2.052073
## [354] train-rmse:12.624360+0.444365 test-rmse:16.091861+2.045932
## [355] train-rmse:12.618706+0.445160 test-rmse:16.092134+2.045447
## [356] train-rmse:12.611787+0.446714 test-rmse:16.091228+2.043500
## [357] train-rmse:12.608153+0.445489 test-rmse:16.090840+2.044082
## [358] train-rmse:12.604913+0.446852 test-rmse:16.093468+2.042726
## [359] train-rmse:12.598991+0.444428 test-rmse:16.099796+2.039013
## [360] train-rmse:12.593523+0.446137 test-rmse:16.098150+2.033837
## [361] train-rmse:12.589956+0.448506 test-rmse:16.100011+2.037986
## Stopping. Best iteration:
## [351] train-rmse:12.642262+0.437834 test-rmse:16.088970+2.042331
best_nrounds <- cv_results$best_iteration
# Train the final model using the best number of rounds found
model_xgb <- xgb.train(
params = params,
data = dtrain,
nrounds = best_nrounds
)
# Make predictions and evaluate the model
train_pred <- predict(model_xgb, dtrain)
test_pred <- predict(model_xgb, dtest)
train_rmse <- sqrt(mean((train_labels - train_pred)^2))
test_rmse <- sqrt(mean((test_labels - test_pred)^2))
# Calculate R-squared for the training set
sst_train <- sum((train_labels - mean(train_labels)) ^ 2)
ssr_train <- sum((train_labels - train_pred) ^ 2)
r_squared_train <- 1 - (ssr_train / sst_train)
# Calculate R-squared for the test set
sst_test <- sum((test_labels - mean(test_labels)) ^ 2)
ssr_test <- sum((test_labels - test_pred) ^ 2)
r_squared_test <- 1 - (ssr_test / sst_test)
train_mape <- mean(abs((train_labels - train_pred) / train_labels)) * 100
test_mape <- mean(abs((test_labels - test_pred) / test_labels)) * 100
train_mae <- mean(abs(train_labels - train_pred))
test_mae <- mean(abs(test_labels - test_pred))
cat("Model Performance Metrics:\n",
"--------------------------\n",
"Training RMSE: ", train_rmse, "\n",
"Test RMSE: ", test_rmse, "\n",
"Training R-squared: ", r_squared_train, "\n",
"Test R-squared: ", r_squared_test, "\n",
"Training MAE: ", train_mae, "\n",
"Test MAE: ", test_mae, "\n",
"Training MAPE: ", train_mape, "%\n",
"Test MAPE: ", test_mape, "%\n", sep="")
## Model Performance Metrics:
## --------------------------
## Training RMSE: 12.94671
## Test RMSE: 14.37189
## Training R-squared: 0.9780564
## Test R-squared: 0.9720868
## Training MAE: 6.999468
## Test MAE: 7.641593
## Training MAPE: 25.73508%
## Test MAPE: 28.05712%
residuals_train <- train_labels - train_pred
residuals_test <- test_labels - test_pred
residuals_data <- data.frame(
Residuals = c(residuals_train, residuals_test),
Dataset = c(rep('Training', length(residuals_train)), rep('Test', length(residuals_test)))
)
# Now plotting residuals with corrected data
ggplot(residuals_data, aes(x = Residuals, fill = Dataset)) +
geom_histogram(binwidth = 1, position = 'identity', alpha = 0.6) +
facet_wrap(~ Dataset) +
ggtitle('Residuals Distribution')
# Assuming train_labels, test_labels, train_pred, and test_pred are correctly defined
# Adjusted Actual vs. Predicted Data Preparation
actual_pred_data <- data.frame(
Actual = c(train_labels, test_labels),
Predicted = c(train_pred, test_pred),
Dataset = c(rep('Training', length(train_labels)), rep('Test', length(test_labels)))
)
# Plotting Actual vs. Predicted Values
ggplot(actual_pred_data, aes(x = Actual, y = Predicted, colour = Dataset)) +
geom_point(alpha = 0.6) +
geom_abline(intercept = 0, slope = 1, linetype = 'dashed', color = 'red') +
xlab('Actual Values') +
ylab('Predicted Values') +
scale_colour_manual(values = c('Training' = 'blue', 'Test' = 'red')) +
ggtitle('Actual vs. Predicted Values')
# Calculate feature importance
importance_matrix <- xgb.importance(feature_names = colnames(train_features), model = model_xgb)
# View the feature importance scores
print(importance_matrix)
## Feature Gain Cover
## 1: DOLLAR_SALES 7.160014e-01 0.416008824
## 2: ITEM_TITANS.NOURISH.WATER.BEVERAGE.WOODSY.YELLOW. 8.460387e-02 0.072248291
## 3: X20SMALL 5.253331e-02 0.039518591
## 4: REGION_KANSAS 2.588577e-02 0.063563389
## 5: X26.32SMALL 2.140248e-02 0.007447924
## 6: CALORIC_SEGMENT 1.875034e-02 0.026356971
## 7: REGION_NORTHERN 1.839264e-02 0.026885309
## 8: POINT5L 1.135987e-02 0.022911266
## 9: ITEM_STRONGLY.IONIC.WATER.BEVERAGE.WOODSY.YELLOW. 1.099146e-02 0.003510788
## 10: MULTI 9.433681e-03 0.021345434
## 11: X6ONE 5.132903e-03 0.010413045
## 12: BRAND_STRONGLY.ENERGY.WATER 4.417200e-03 0.029111763
## 13: SEASON_SUMMER 3.310559e-03 0.018095639
## 14: REGION_NEWMEXICO 3.074743e-03 0.034287331
## 15: REGION_COLORADO 2.858571e-03 0.013095214
## 16: SEASON_WINTER 2.382166e-03 0.022375256
## 17: REGION_ARIZONA 1.989458e-03 0.030481949
## 18: BRAND_VITAMINAL.FLOW 1.603145e-03 0.016418292
## 19: REGION_MOUNTAIN 1.128815e-03 0.018426347
## 20: REGION_SOCAL 9.943884e-04 0.022579104
## 21: REGION_PRAIRIE 8.475404e-04 0.026447717
## 22: JUG 7.345926e-04 0.002186636
## 23: REGION_DESERT_SW 7.087854e-04 0.009764727
## 24: SEASON_FALL 7.024262e-04 0.009622391
## 25: SEASON_SPRING 4.200505e-04 0.008589790
## 26: REGION_NOCAL 1.803521e-04 0.017919968
## 27: REGION_CALI_NEVADA 1.078172e-04 0.009019048
## 28: MLT 5.166534e-05 0.001368996
## Feature Gain Cover
## Frequency
## 1: 0.277308955
## 2: 0.040318800
## 3: 0.065869667
## 4: 0.032817628
## 5: 0.015471167
## 6: 0.053211439
## 7: 0.034224098
## 8: 0.052039381
## 9: 0.003750586
## 10: 0.027660572
## 11: 0.017112049
## 12: 0.033755274
## 13: 0.047819972
## 14: 0.019924988
## 15: 0.024613221
## 16: 0.041725270
## 17: 0.033286451
## 18: 0.014767932
## 19: 0.020862635
## 20: 0.015705579
## 21: 0.018049695
## 22: 0.003281763
## 23: 0.015939991
## 24: 0.032583216
## 25: 0.040084388
## 26: 0.009376465
## 27: 0.005860291
## 28: 0.002578528
## Frequency
# Plot the feature importance
xgb.plot.importance(importance_matrix = importance_matrix)
# Compute partial dependence data for 'DOLLAR_SALES' and 'CALORIC_SEGMENT'
pd <- partial(model_xgb, pred.var = c("DOLLAR_SALES", "CALORIC_SEGMENT"), train = train_features, grid.resolution = 20)
# Default PDP
pdp1 <- plotPartial(pd, plot = TRUE)
# Add contour lines and use a different color palette
rwb <- colorRampPalette(c("red", "white", "blue"))
pdp2 <- plotPartial(pd, contour = TRUE, col.regions = rwb)
# 3-D surface
pdp3 <- plotPartial(pd, levelplot = FALSE, zlab = "Predicted Outcome", drape = TRUE, colorkey = TRUE, screen = list(z = -20, x = -60))
# Combine plots into one window
grid.arrange(pdp1, pdp2, pdp3, ncol = 3)
Model with NO DOLLAR SALES Variable
# Assuming 'df' is your complete dataframe and 'UNIT_SALES' is your target variable
df2 <- df
# Remove DOLLAR_SALES from the features
df2$DOLLAR_SALES <- NULL
# Split the updated data into training and testing sets (assuming you're using a similar approach as before)
set.seed(123)
df2_testtrn <- initial_split(df2, prop = 0.8, strata = UNIT_SALES)
Train <- training(df2_testtrn)
Test <- testing(df2_testtrn)
# Prepare features and labels for XGBoost, excluding DOLLAR_SALES
train_features <- Train[, -which(names(Train) == "UNIT_SALES")]
train_labels <- Train$UNIT_SALES
test_features <- Test[, -which(names(Test) == "UNIT_SALES")]
test_labels <- Test$UNIT_SALES
# Convert data to DMatrix format for XGBoost
dtrain <- xgb.DMatrix(data = as.matrix(train_features), label = train_labels)
dtest <- xgb.DMatrix(data = as.matrix(test_features), label = test_labels)
# Assuming 'params' and 'best_nrounds' are defined as before
# Train the final model without DOLLAR_SALES
model_xgb_no_dollar_sales <- xgb.train(
params = params,
data = dtrain,
nrounds = best_nrounds
)
# Make predictions and evaluate the model
train_pred <- predict(model_xgb_no_dollar_sales, dtrain)
test_pred <- predict(model_xgb_no_dollar_sales, dtest)
train_rmse <- sqrt(mean((train_labels - train_pred)^2))
test_rmse <- sqrt(mean((test_labels - test_pred)^2))
# Calculate R-squared for the training set
sst_train <- sum((train_labels - mean(train_labels)) ^ 2)
ssr_train <- sum((train_labels - train_pred) ^ 2)
r_squared_train <- 1 - (ssr_train / sst_train)
# Calculate R-squared for the test set
sst_test <- sum((test_labels - mean(test_labels)) ^ 2)
ssr_test <- sum((test_labels - test_pred) ^ 2)
r_squared_test <- 1 - (ssr_test / sst_test)
train_mape <- mean(abs((train_labels - train_pred) / train_labels)) * 100
test_mape <- mean(abs((test_labels - test_pred) / test_labels)) * 100
train_mae <- mean(abs(train_labels - train_pred))
test_mae <- mean(abs(test_labels - test_pred))
cat("Model Performance Metrics:\n",
"--------------------------\n",
"Training RMSE: ", train_rmse, "\n",
"Test RMSE: ", test_rmse, "\n",
"Training R-squared: ", r_squared_train, "\n",
"Test R-squared: ", r_squared_test, "\n",
"Training MAE: ", train_mae, "\n",
"Test MAE: ", test_mae, "\n",
"Training MAPE: ", train_mape, "%\n",
"Test MAPE: ", test_mape, "%\n", sep="")
## Model Performance Metrics:
## --------------------------
## Training RMSE: 57.75024
## Test RMSE: 60.45803
## Training R-squared: 0.5633861
## Test R-squared: 0.5060416
## Training MAE: 32.79925
## Test MAE: 34.51334
## Training MAPE: 196.601%
## Test MAPE: 206.9347%
A significant loss when DOLLAR_SALES is removed. But still we can see all the other features that together are very important.
# Calculate feature importance
importance_matrix2 <- xgb.importance(feature_names = colnames(train_features), model = model_xgb_no_dollar_sales)
# View the feature importance scores
print(importance_matrix2)
## Feature
## 1: ITEM_TITANS.NOURISH.WATER.BEVERAGE.WOODSY.YELLOW.
## 2: X20SMALL
## 3: REGION_KANSAS
## 4: CALORIC_SEGMENT
## 5: REGION_NORTHERN
## 6: BRAND_STRONGLY.ENERGY.WATER
## 7: X26.32SMALL
## 8: POINT5L
## 9: SEASON_SUMMER
## 10: SEASON_WINTER
## 11: REGION_NEWMEXICO
## 12: X6ONE
## 13: SEASON_FALL
## 14: SEASON_SPRING
## 15: BRAND_VITAMINAL.FLOW
## 16: REGION_ARIZONA
## 17: MULTI
## 18: REGION_PRAIRIE
## 19: REGION_SOCAL
## 20: REGION_MOUNTAIN
## 21: REGION_COLORADO
## 22: REGION_DESERT_SW
## 23: REGION_CALI_NEVADA
## 24: ITEM_STRONGLY.IONIC.WATER.BEVERAGE.WOODSY.YELLOW.
## 25: REGION_NOCAL
## 26: JUG
## 27: MLT
## 28: ITEM_TITANS.NOURISH.WATER.NUTRIENT.ENHANCED.BVRG.WOODSY.YELLOW.
## Feature
## Gain Cover Frequency
## 1: 2.662779e-01 0.0764586652 0.0533990993
## 2: 1.769269e-01 0.0516776858 0.0675530774
## 3: 1.595826e-01 0.1127620588 0.0467510187
## 4: 9.201953e-02 0.0578886801 0.0812781471
## 5: 5.986639e-02 0.0426304578 0.0360283080
## 6: 3.487443e-02 0.0254071213 0.0334548574
## 7: 3.298050e-02 0.0169064936 0.0250911430
## 8: 3.059437e-02 0.0206221847 0.0553291872
## 9: 2.233120e-02 0.0515547338 0.0782757881
## 10: 1.931196e-02 0.0491548540 0.0739867038
## 11: 1.851870e-02 0.0349538334 0.0223032383
## 12: 1.056117e-02 0.0090485515 0.0197297877
## 13: 9.596539e-03 0.0357442202 0.0613339052
## 14: 9.429947e-03 0.0354170541 0.0662663521
## 15: 8.704337e-03 0.0083981897 0.0092215312
## 16: 7.556334e-03 0.0461578166 0.0362427622
## 17: 7.540057e-03 0.0256035268 0.0242333262
## 18: 6.702594e-03 0.0475177000 0.0293802273
## 19: 6.415693e-03 0.0355916221 0.0272356852
## 20: 4.606337e-03 0.0485284474 0.0340982200
## 21: 4.017641e-03 0.0366521326 0.0257345057
## 22: 3.187877e-03 0.0452904029 0.0283079563
## 23: 2.724455e-03 0.0349067173 0.0184430624
## 24: 2.190476e-03 0.0077097116 0.0087926228
## 25: 1.587601e-03 0.0348948059 0.0233755093
## 26: 1.013986e-03 0.0051364549 0.0092215312
## 27: 8.680470e-04 0.0031889431 0.0047179927
## 28: 1.251844e-05 0.0001969349 0.0002144542
## Gain Cover Frequency
xgb.plot.importance(importance_matrix = importance_matrix2)
if (!requireNamespace("pdp", quietly = TRUE)) install.packages("pdp")
if (!requireNamespace("xgboost", quietly = TRUE)) install.packages("xgboost")
library(pdp)
library(xgboost)
pdp::partial(model_xgb_no_dollar_sales, pred.var = "REGION_KANSAS", train = train_features)
## REGION_KANSAS yhat
## 1 0 62.04574
## 2 1 224.83023
# We know this has a negative effect on UNIT_SALES but a positive effect on DOLLAR_SALES
pdp::partial(model_xgb_no_dollar_sales, pred.var = "REGION_NORTHERN", train = train_features)
## REGION_NORTHERN yhat
## 1 0 59.34926
## 2 1 77.09184
pd <- partial(model_xgb_no_dollar_sales, pred.var = c("REGION_KANSAS", "REGION_NORTHERN"), train = train_features, grid.resolution = 20)
# Default PDP
pdp1 <- plotPartial(pd, plot = TRUE)
# Add contour lines and use a different color palette
rwb <- colorRampPalette(c("red", "white", "blue"))
pdp2 <- plotPartial(pd, contour = TRUE, col.regions = rwb)
# 3-D surface
pdp3 <- plotPartial(pd, levelplot = FALSE, zlab = "Predicted Outcome", drape = TRUE, colorkey = TRUE, screen = list(z = -20, x = -60))
# Combine plots into one window
grid.arrange(pdp1, pdp2, pdp3, ncol = 3)
Item Description: Diet Energy Moonlit Casava 2L Multi Jug Caloric Segment: Diet Market Category: Energy Manufacturer: Swire-CC Brand: Diet Moonlit Package Type: 2L Multi Jug Flavor: ‘Cassava’ Swire plans to release this product for 6 months. What will the forecasted demand be, in weeks, for this product?
#summary(df) Same output as shown
# create a table of total values by brand
brand_summary <- df %>%
group_by(BRAND) %>%
summarise(
total_units_sold = sum(UNIT_SALES),
total_revenue = sum(DOLLAR_SALES),
avg_price = total_revenue / total_units_sold,
total_days_sold = n() # Count the number of rows for each brand
) %>%
arrange(desc(total_revenue)) %>% # Order by revenue in descending order
mutate(rank = row_number())
summary(brand_summary)
## BRAND total_units_sold total_revenue avg_price
## Length:288 Min. : 1 Min. : 1 Min. : 0.5315
## Class :character 1st Qu.: 2310 1st Qu.: 7563 1st Qu.: 2.0861
## Mode :character Median : 94691 Median : 266075 Median : 3.0291
## Mean : 1473003 Mean : 4989427 Mean : 3.2661
## 3rd Qu.: 651385 3rd Qu.: 2161764 3rd Qu.: 3.7252
## Max. :40414038 Max. :159387186 Max. :42.9378
## total_days_sold rank
## Min. : 1.0 Min. : 1.00
## 1st Qu.: 121.8 1st Qu.: 72.75
## Median : 1988.0 Median :144.50
## Mean : 8493.5 Mean :144.50
## 3rd Qu.: 8075.8 3rd Qu.:216.25
## Max. :124603.0 Max. :288.00
print(brand_summary[brand_summary$BRAND == "DIET MOONLIT", ])
## # A tibble: 1 × 6
## BRAND total_units_sold total_revenue avg_price total_days_sold rank
## <chr> <dbl> <dbl> <dbl> <int> <int>
## 1 DIET MOONLIT 749702. 2623607. 3.50 7617 69
Diet Moonlit is a rising star ranking 69 out of 288 brands in terms of total revenue, with an average price of $3.50 slightly above the overall mean of $3.27.
# Filter the dataframe for only 'DIET SMASH'
filtered_df <- df %>%
filter(BRAND == "DIET MOONLIT")
# Create the plot
ggplot(filtered_df, aes(x = UNIT_SALES, y = DOLLAR_SALES)) +
geom_point(color = "red", alpha = 1) + # Bright red points with full opacity
geom_smooth(method = "lm", color = "black", se = FALSE) + # Add linear regression line without confidence band
labs(title = "Linear Model of UNIT_SALES vs. DOLLAR_SALES for DIET SMASH",
x = "UNIT SALES",
y = "DOLLAR SALES") +
theme_minimal() +
theme(legend.position = "none")
## `geom_smooth()` using formula = 'y ~ x'
DIET MOONLIT has a tight cluster below 1,000 unit sales and $2,500 revenue, but there are some remarkable high fliers nearing $20,000 and just over 3000 units.
filtered_df %>%
mutate(DATE = as.Date(DATE)) %>%
mutate(WEEK = as.integer(format(DATE, "%U"))) %>%
group_by(WEEK) %>%
summarise(total_sales = sum(UNIT_SALES)) %>%
ggplot(aes(x = WEEK, y = total_sales)) +
geom_line(color = "black") + # Blue line connecting points
labs(title = "Total Sales by Week of the Year",
x = "Week of the Year",
y = "Total Unit Sales") +
theme_minimal()
> DIET MOONLIT shows many peaks and valleys in sales by week.
library(zoo)
# Calculate total sales for each group of 211 consecutive weeks (6 months)
sales_by_group <- filtered_df %>%
mutate(DATE = as.Date(DATE)) %>%
mutate(WEEK = as.integer(format(DATE, "%U"))) %>%
group_by(WEEK) %>%
summarise(total_sales = sum(UNIT_SALES)) %>%
mutate(sales_in_group = rollsum(total_sales, 21, align = "left", fill = NA)) %>%
mutate(week_label = paste0("Week ", WEEK + 1, " to Week ", WEEK + 21)) %>%
arrange(WEEK) %>% # Order by WEEK
filter(!is.na(sales_in_group)) # Remove rows with sales_in_group = NA
# Plot the bar chart
sales_by_group$week_label <- factor(sales_by_group$week_label, levels = sales_by_group$week_label[order(sales_by_group$WEEK)])
ggplot(sales_by_group, aes(x = factor(week_label), y = sales_in_group)) +
geom_bar(stat = "identity", fill = "black") +
labs(title = "Total Sales for Each 6-month Grouping",
x = "Weeks (Starting from Week 1)",
y = "Total Sales") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
> DIET MOONLIT has it’s best 6 month runs week 7 - 27
historically.
#find the best 21 weeks for Casava sales
# Calculate total sales for each group of 21 consecutive weeks
sales_by_casava <- df %>%
filter(str_detect(ITEM, "CASAVA")) %>%
mutate(DATE = as.Date(DATE)) %>%
mutate(WEEK = as.integer(format(DATE, "%U"))) %>%
group_by(WEEK) %>%
summarise(total_sales = sum(UNIT_SALES)) %>%
mutate(sales_in_group = rollsum(total_sales, 21, align = "left", fill = NA)) %>%
mutate(week_label = paste0("Week ", WEEK + 1, " to Week ", WEEK + 21)) %>%
arrange(WEEK) %>% # Order by WEEK
filter(!is.na(sales_in_group)) # Remove rows with sales_in_group = NA
# Plot the bar chart
sales_by_casava$week_label <- factor(sales_by_casava$week_label, levels = sales_by_casava$week_label[order(sales_by_casava$WEEK)])
ggplot(sales_by_casava, aes(x = factor(week_label), y = sales_in_group)) +
geom_bar(stat = "identity", fill = "black") +
labs(title = "Total Sales for Each 21-Week Grouping",
x = "Weeks (Starting from Week 1)",
y = "Total Sales") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
> Casava sales are best in the 21 weeks from week 14 to 34.
#find the best 21 weeks for casava, energy, diet
# Calculate total sales for each group of 21 consecutive weeks
sales_by_innovation <- df %>%
filter(CATEGORY == "ENERGY",
str_detect(ITEM, "CASAVA")) %>%
mutate(DATE = as.Date(DATE)) %>%
mutate(WEEK = as.integer(format(DATE, "%U"))) %>%
group_by(WEEK) %>%
summarise(total_sales = sum(UNIT_SALES)) %>%
mutate(sales_in_group = rollsum(total_sales, 13, align = "left", fill = NA)) %>%
mutate(week_label = paste0("Week ", WEEK + 1, " to Week ", WEEK + 13)) %>%
arrange(WEEK) %>% # Order by WEEK
filter(!is.na(sales_in_group)) # Remove rows with sales_in_group = NA
# Plot the bar chart
sales_by_innovation$week_label <- factor(sales_by_innovation$week_label, levels = sales_by_innovation$week_label[order(sales_by_innovation$WEEK)])
ggplot(sales_by_innovation, aes(x = factor(week_label), y = sales_in_group)) +
geom_bar(stat = "identity", fill = "black") +
labs(title = "Total Sales for Each 13-Week Grouping",
x = "Weeks (Starting from Week 1)",
y = "Total Sales") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
# Make a new smaller "innovation" data frame
#create innovation based on Energy, Casava
innovation<- df %>%
filter(CATEGORY == "ENERGY",
str_detect(ITEM, "CASAVA"))
#unique PACKAGE string from innovation
print(unique(innovation$PACKAGE))
## [1] "16SMALL MULTI CUP" "16SMALL 24ONE CUP" "12SMALL MULTI CUP"
library(dplyr)
library(lubridate)
innovation <- innovation %>%
mutate(
MONTH = month(ymd(DATE)), # Extract month using lubridate's ymd function
MONTH = as.factor(MONTH) # Convert the extracted month into a factor
)
str(innovation)
## 'data.frame': 5069 obs. of 13 variables:
## $ MARKET_KEY : chr "685" "1811" "6" "212" ...
## $ DATE : chr "2022-04-16" "2021-10-02" "2023-10-07" "2023-07-15" ...
## $ CALORIC_SEGMENT: num 1 1 1 1 1 1 1 1 1 1 ...
## $ CATEGORY : chr "ENERGY" "ENERGY" "ENERGY" "ENERGY" ...
## $ UNIT_SALES : num 25 18 18 127 2 3 50 9 28 12 ...
## $ DOLLAR_SALES : num 52.58 30.02 29.5 314.43 3.34 ...
## $ MANUFACTURER : chr "JOLLYS" "JOLLYS" "JOLLYS" "PONYS" ...
## $ BRAND : chr "SUPER-DUPER JUICED" "SUPER-DUPER JUICED" "SUPER-DUPER RECOVERY" "MYTHICAL BEVERAGE" ...
## $ PACKAGE : chr "16SMALL MULTI CUP" "16SMALL MULTI CUP" "16SMALL MULTI CUP" "16SMALL MULTI CUP" ...
## $ ITEM : chr "SUPER-DUPER JUICED ENERGY DRINK CASAVA SUNSET GUAVA CUP 16 LIQUID SMALL" "SUPER-DUPER JUICED ENERGY DRINK CASAVA SUNSET GUAVA CUP 16 LIQUID SMALL" "SUPER-DUPER RECOVERY ENERGY DRINK CASAVA JACK CUP 16 LIQUID SMALL" "MYTHICAL BEVERAGE RESERVE ENERGY DRINK WHITE CASAVA CUP 16 LIQUID SMALL" ...
## $ REGION : chr "COLORADO" "KANSAS" "NORTHERN" "DESERT_SW" ...
## $ MONTH : Factor w/ 12 levels "1","2","3","4",..: 4 10 10 7 8 3 11 4 4 9 ...
## $ SEASON : chr "SPRING" "FALL" "FALL" "SUMMER" ...
print(unique(innovation$ITEM))
## [1] "SUPER-DUPER JUICED ENERGY DRINK CASAVA SUNSET GUAVA CUP 16 LIQUID SMALL"
## [2] "SUPER-DUPER RECOVERY ENERGY DRINK CASAVA JACK CUP 16 LIQUID SMALL"
## [3] "MYTHICAL BEVERAGE RESERVE ENERGY DRINK WHITE CASAVA CUP 16 LIQUID SMALL"
## [4] "SUPER-DUPER PITAYA ED ENERGY DRINK CASAVA NO ARTIFICIAL SWEETENERS CUP 16 LIQUID SMALL"
## [5] "MYTHICAL BEVERAGE RESERVE ENERGY DRINK WHITE CASAVA CUP 16 LIQUID SMALL X24"
## [6] "JUMPIN-FISH ENERGY DRINK CASAVA JACK CUP 12 LIQUID SMALL"
# Count the number of unique PACKAGE column of our sample
table(innovation$PACKAGE)
##
## 12SMALL MULTI CUP 16SMALL 24ONE CUP 16SMALL MULTI CUP
## 3 3 5063
# Creating an 'innovation' data frame
model <- lm(DOLLAR_SALES ~ UNIT_SALES + CALORIC_SEGMENT + PACKAGE + SEASON + REGION, data = innovation)
summary(model)
##
## Call:
## lm(formula = DOLLAR_SALES ~ UNIT_SALES + CALORIC_SEGMENT + PACKAGE +
## SEASON + REGION, data = innovation)
##
## Residuals:
## Min 1Q Median 3Q Max
## -675.01 -11.00 3.49 19.22 881.56
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.963688 29.351687 0.169 0.865716
## UNIT_SALES 2.335631 0.004517 517.110 < 2e-16 ***
## CALORIC_SEGMENT NA NA NA NA
## PACKAGE16SMALL 24ONE CUP 9.015525 41.412865 0.218 0.827673
## PACKAGE16SMALL MULTI CUP -10.472066 29.301506 -0.357 0.720815
## SEASONSPRING -0.940580 1.986640 -0.473 0.635911
## SEASONSUMMER -9.386618 1.981350 -4.737 2.22e-06 ***
## SEASONWINTER 0.356214 1.995442 0.179 0.858327
## REGIONCALI_NEVADA 3.969486 4.369068 0.909 0.363635
## REGIONCOLORADO 4.995027 2.599420 1.922 0.054713 .
## REGIONDESERT_SW -1.735148 3.109124 -0.558 0.576813
## REGIONKANSAS -31.939006 6.342170 -5.036 4.92e-07 ***
## REGIONMOUNTAIN -22.384831 2.688393 -8.326 < 2e-16 ***
## REGIONNEWMEXICO -2.944923 3.963758 -0.743 0.457539
## REGIONNOCAL 11.129613 3.993008 2.787 0.005335 **
## REGIONNORTHERN -11.480915 2.071048 -5.544 3.11e-08 ***
## REGIONPRAIRIE -3.761095 4.735497 -0.794 0.427096
## REGIONSOCAL 11.447384 2.943193 3.889 0.000102 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 50.69 on 5052 degrees of freedom
## Multiple R-squared: 0.9836, Adjusted R-squared: 0.9836
## F-statistic: 1.895e+04 on 16 and 5052 DF, p-value: < 2.2e-16
Cassava and Energy together do quite well (not possible to also add in DIET, but we will expect that folks that like Cassava Regular Energy will also like DIET). R2 of 0.98. Summer is statistically signficant, but negatively correlated with sales. SOCAL and NOCAL are significant in the positive direction.
library(dplyr)
small_group <- df %>%
filter(UNIT_SALES < 7000, DOLLAR_SALES < 20000)
skim(small_group)
| Name | small_group |
| Number of rows | 2440830 |
| Number of columns | 13 |
| _______________________ | |
| Column type frequency: | |
| character | 9 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| MARKET_KEY | 0 | 1 | 1 | 4 | 0 | 200 | 0 |
| DATE | 0 | 1 | 10 | 10 | 0 | 152 | 0 |
| CATEGORY | 0 | 1 | 3 | 18 | 0 | 5 | 0 |
| MANUFACTURER | 0 | 1 | 5 | 8 | 0 | 8 | 0 |
| BRAND | 0 | 1 | 4 | 56 | 0 | 288 | 0 |
| PACKAGE | 0 | 1 | 11 | 26 | 0 | 95 | 0 |
| ITEM | 0 | 1 | 26 | 142 | 0 | 2999 | 0 |
| REGION | 0 | 1 | 5 | 11 | 0 | 11 | 0 |
| SEASON | 0 | 1 | 4 | 6 | 0 | 4 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| CALORIC_SEGMENT | 0 | 1 | 0.50 | 0.50 | 0.00 | 0.00 | 1.00 | 1.0 | 1.00 | ▇▁▁▁▇ |
| UNIT_SALES | 0 | 1 | 149.70 | 381.66 | 0.04 | 11.00 | 40.00 | 125.0 | 6999.00 | ▇▁▁▁▁ |
| DOLLAR_SALES | 0 | 1 | 496.70 | 1224.33 | 0.01 | 36.31 | 134.48 | 423.7 | 19995.05 | ▇▁▁▁▁ |
| MONTH | 0 | 1 | 6.28 | 3.43 | 1.00 | 3.00 | 6.00 | 9.0 | 12.00 | ▇▆▆▅▇ |
skim(df %>% filter(BRAND == "DIET MOONLIT"))
| Name | df %>% filter(BRAND == “D… |
| Number of rows | 7617 |
| Number of columns | 13 |
| _______________________ | |
| Column type frequency: | |
| character | 9 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| MARKET_KEY | 0 | 1 | 1 | 4 | 0 | 200 | 0 |
| DATE | 0 | 1 | 10 | 10 | 0 | 147 | 0 |
| CATEGORY | 0 | 1 | 3 | 3 | 0 | 1 | 0 |
| MANUFACTURER | 0 | 1 | 8 | 8 | 0 | 1 | 0 |
| BRAND | 0 | 1 | 12 | 12 | 0 | 1 | 0 |
| PACKAGE | 0 | 1 | 12 | 17 | 0 | 5 | 0 |
| ITEM | 0 | 1 | 50 | 63 | 0 | 5 | 0 |
| REGION | 0 | 1 | 5 | 11 | 0 | 11 | 0 |
| SEASON | 0 | 1 | 4 | 6 | 0 | 4 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| CALORIC_SEGMENT | 0 | 1 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | ▁▁▇▁▁ |
| UNIT_SALES | 0 | 1 | 98.42 | 292.11 | 1.00 | 23.0 | 51.00 | 92.00 | 6216.00 | ▇▁▁▁▁ |
| DOLLAR_SALES | 0 | 1 | 344.44 | 1041.41 | 0.47 | 56.3 | 133.66 | 331.84 | 18382.96 | ▇▁▁▁▁ |
| MONTH | 0 | 1 | 6.30 | 3.39 | 1.00 | 3.0 | 6.00 | 9.00 | 12.00 | ▇▆▆▅▇ |
Our small df has a higher mean of unit sales and dollar sales of 149 and $496. as compared to the full df of DIET MOONLIT of 98 and $344.
# Create a scatter plot with the regression line, colored by MANUFACTURER
ggplot(small_group, aes(x = UNIT_SALES, y = DOLLAR_SALES, color = MANUFACTURER)) +
geom_point(alpha = 0.5) + # Adjust alpha to avoid overplotting, if necessary
geom_smooth(method = "lm", color = "black", se = FALSE) + # Add linear regression line without confidence band for clarity
labs(title = "Linear Model of UNIT_SALES vs. DOLLAR_SALES by MANUFACTURER",
x = "UNTI SALES",
y = "DOLLAR SALES") +
theme_minimal() +
theme(legend.position = "bottom") # Adjust legend position if needed
## `geom_smooth()` using formula = 'y ~ x'
Behold the realm of DIET MOONLIT. Certain items sell much better, or wosrse with consideration of slop of dollars to units sold. While most of its realm is in the lower left hand portion, other brands have sales through both its unit and dollar sales vectors.
# Investigating drinks with casava as a flavor in the Item description.
# Create a new data frame with only the rows where the ITEM column contains the word 'casava'
casava_small <- df[grep("casava", df$ITEM, ignore.case = TRUE), ]
skim(casava_small)
| Name | casava_small |
| Number of rows | 42298 |
| Number of columns | 13 |
| _______________________ | |
| Column type frequency: | |
| character | 9 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| MARKET_KEY | 0 | 1 | 1 | 4 | 0 | 200 | 0 |
| DATE | 0 | 1 | 10 | 10 | 0 | 152 | 0 |
| CATEGORY | 0 | 1 | 3 | 18 | 0 | 4 | 0 |
| MANUFACTURER | 0 | 1 | 5 | 8 | 0 | 5 | 0 |
| BRAND | 0 | 1 | 5 | 26 | 0 | 27 | 0 |
| PACKAGE | 0 | 1 | 12 | 26 | 0 | 25 | 0 |
| ITEM | 0 | 1 | 46 | 112 | 0 | 83 | 0 |
| REGION | 0 | 1 | 5 | 11 | 0 | 11 | 0 |
| SEASON | 0 | 1 | 4 | 6 | 0 | 4 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| CALORIC_SEGMENT | 0 | 1 | 0.60 | 0.49 | 0.00 | 0.00 | 1.00 | 1.00 | 1.00 | ▅▁▁▁▇ |
| UNIT_SALES | 0 | 1 | 71.19 | 167.47 | 1.00 | 9.00 | 29.00 | 78.00 | 6678.00 | ▇▁▁▁▁ |
| DOLLAR_SALES | 0 | 1 | 184.08 | 376.57 | 0.06 | 23.02 | 75.76 | 210.82 | 12569.79 | ▇▁▁▁▁ |
| MONTH | 0 | 1 | 6.48 | 3.34 | 1.00 | 4.00 | 7.00 | 9.00 | 12.00 | ▇▆▆▅▇ |
Casava has a much lower unit sales and dollar sales at 71 and $184 than Diet Moonlight at 98 and $344.
# casava small is dataframe
model <- lm(DOLLAR_SALES ~ UNIT_SALES + CALORIC_SEGMENT + PACKAGE + CATEGORY + SEASON + REGION, data = casava_small)
summary(model)
##
## Call:
## lm(formula = DOLLAR_SALES ~ UNIT_SALES + CALORIC_SEGMENT + PACKAGE +
## CATEGORY + SEASON + REGION, data = casava_small)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3586.0 -33.7 -2.8 27.4 6441.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.231e+01 3.023e+01 1.730 0.083610 .
## UNIT_SALES 1.989e+00 4.773e-03 416.680 < 2e-16 ***
## CALORIC_SEGMENT 3.840e+01 2.881e+01 1.333 0.182514
## PACKAGE.5L 6ONE JUG -5.519e+00 6.895e+00 -0.800 0.423522
## PACKAGE.5L MULTI JUG -7.090e+01 1.038e+02 -0.683 0.494725
## PACKAGE12SMALL 12ONE CUP 2.229e+02 8.384e+00 26.590 < 2e-16 ***
## PACKAGE12SMALL 24ONE PLASTICS JUG 8.004e+01 1.462e+02 0.547 0.584147
## PACKAGE12SMALL 6ONE CUP -5.452e+01 3.446e+01 -1.582 0.113648
## PACKAGE12SMALL 6ONE MEDIUM CUP -3.456e+01 1.874e+01 -1.844 0.065143 .
## PACKAGE12SMALL 6ONE SHADYES JUG -1.113e+02 1.463e+02 -0.761 0.446865
## PACKAGE12SMALL 8ONE BUMPY CUP -4.476e-01 1.759e+01 -0.025 0.979704
## PACKAGE12SMALL 8ONE CUP 1.599e+02 2.521e+01 6.342 2.29e-10 ***
## PACKAGE12SMALL MLT MEDIUM CUP -4.479e+01 3.987e+01 -1.123 0.261293
## PACKAGE12SMALL MLT PLASTICS JUG -4.235e+01 9.134e+00 -4.637 3.55e-06 ***
## PACKAGE12SMALL MULTI CUP 3.735e+01 3.820e+01 0.978 0.328159
## PACKAGE16SMALL 24ONE CUP -1.149e+02 8.476e+01 -1.356 0.175140
## PACKAGE16SMALL MLT SHADYES JUG -4.473e+01 1.462e+02 -0.306 0.759722
## PACKAGE16SMALL MULTI CUP -9.059e+01 8.520e+00 -10.632 < 2e-16 ***
## PACKAGE18SMALL MULTI JUG -4.557e+01 5.189e+00 -8.784 < 2e-16 ***
## PACKAGE1L MULTI JUG -3.327e+01 1.259e+01 -2.643 0.008227 **
## PACKAGE20SMALL MULTI JUG -2.398e+01 8.341e+00 -2.875 0.004040 **
## PACKAGE24 - 25SMALL MULTI JUG -5.784e+01 6.641e+00 -8.710 < 2e-16 ***
## PACKAGE24SMALL MLT SHADYES JUG -4.853e+01 1.777e+01 -2.731 0.006318 **
## PACKAGE2L MULTI JUG -6.607e+01 8.359e+00 -7.904 2.76e-15 ***
## PACKAGE3L MULTI JUG -1.002e+02 3.892e+01 -2.575 0.010039 *
## PACKAGE7.5SMALL 6ONE CUP -2.666e+01 3.225e+01 -0.827 0.408391
## PACKAGEALL OTHER ONES 6.118e+01 2.519e+01 2.429 0.015140 *
## CATEGORYING ENHANCED WATER -2.870e+01 2.979e+01 -0.963 0.335324
## CATEGORYSPARKLING WATER -1.400e+02 1.660e+01 -8.435 < 2e-16 ***
## CATEGORYSSD -6.955e+01 3.379e+00 -20.580 < 2e-16 ***
## SEASONSPRING 1.956e+00 2.032e+00 0.963 0.335692
## SEASONSUMMER 3.807e+00 2.000e+00 1.903 0.056996 .
## SEASONWINTER -7.786e+00 2.089e+00 -3.727 0.000194 ***
## REGIONCALI_NEVADA 4.253e+00 4.093e+00 1.039 0.298752
## REGIONCOLORADO 2.871e+01 2.591e+00 11.082 < 2e-16 ***
## REGIONDESERT_SW 3.952e+00 2.998e+00 1.318 0.187461
## REGIONKANSAS 8.784e+01 5.536e+00 15.867 < 2e-16 ***
## REGIONMOUNTAIN 2.101e+01 2.793e+00 7.522 5.52e-14 ***
## REGIONNEWMEXICO 2.276e+00 3.579e+00 0.636 0.524778
## REGIONNOCAL 7.957e-01 3.733e+00 0.213 0.831219
## REGIONNORTHERN 2.658e+01 2.116e+00 12.563 < 2e-16 ***
## REGIONPRAIRIE 8.415e+00 4.522e+00 1.861 0.062764 .
## REGIONSOCAL 3.299e+01 2.834e+00 11.641 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 146 on 42255 degrees of freedom
## Multiple R-squared: 0.8498, Adjusted R-squared: 0.8497
## F-statistic: 5694 on 42 and 42255 DF, p-value: < 2.2e-16
Our Casava small has a lower R2 of 0.85, but also contains much more data with nearly 42K observations compared to our innovation df at about 5 observations. There are many signficant features, but nothing that swings the needle in huge ways.
rm(list = ls())
#head(casava_small)
DIET MOONLIT has pretty decent sales at 69th place in total revenue. Casava is not the sexiest flavor in town, but with our innovation dataframe the R2 is quite high (although it is based on regular and no specific package type). There are some weeks that look great for 6 month predictions, it’s just a matter of deciding which ones to use.
# Load and prepare dataset
df <- read.csv("casava_one_hot.csv")
# Load and prepare dataset
#str(df)
df <- df %>%
#select(-DATE, -MONTH, -SEASON, -BRAND, -REGION, -ITEM )
select(-DATE, -MONTH, -SEASON)
# Summarize the dataset
skimr::skim(df)
| Name | df |
| Number of rows | 7968 |
| Number of columns | 46 |
| _______________________ | |
| Column type frequency: | |
| numeric | 46 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| CALORIC_SEGMENT | 0 | 1 | 0.65 | 0.48 | 0.00 | 0.00 | 1.00 | 1.00 | 1.00 | ▅▁▁▁▇ |
| UNIT_SALES | 0 | 1 | 82.74 | 194.92 | 1.00 | 18.00 | 46.00 | 97.00 | 4718.00 | ▇▁▁▁▁ |
| DOLLAR_SALES | 0 | 1 | 165.06 | 374.63 | 0.25 | 33.23 | 83.97 | 184.78 | 10806.82 | ▇▁▁▁▁ |
| ENERGY_DRINK | 0 | 1 | 0.64 | 0.48 | 0.00 | 0.00 | 1.00 | 1.00 | 1.00 | ▅▁▁▁▇ |
| NO_ARTIFICIAL_SWEETNERS | 0 | 1 | 0.06 | 0.24 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| X16SMALL.24ONE.CUP | 0 | 1 | 0.00 | 0.02 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| X16SMALL.MULTI.CUP | 0 | 1 | 0.64 | 0.48 | 0.00 | 0.00 | 1.00 | 1.00 | 1.00 | ▅▁▁▁▇ |
| X12SMALL.MULTI.CUP | 0 | 1 | 0.00 | 0.02 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| X2L.MULTI.JUG | 0 | 1 | 0.36 | 0.48 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | ▇▁▁▁▅ |
| SUNSET | 0 | 1 | 0.57 | 0.50 | 0.00 | 0.00 | 1.00 | 1.00 | 1.00 | ▆▁▁▁▇ |
| BLAST | 0 | 1 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ▁▁▇▁▁ |
| JUICED | 0 | 1 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ▁▁▇▁▁ |
| GUAVA | 0 | 1 | 0.21 | 0.41 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▂ |
| RECOVERY | 0 | 1 | 0.13 | 0.33 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| JACK | 0 | 1 | 0.13 | 0.33 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| RESERVE | 0 | 1 | 0.23 | 0.42 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▂ |
| WHITE | 0 | 1 | 0.23 | 0.42 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▂ |
| PITAYA | 0 | 1 | 0.06 | 0.24 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| ED | 0 | 1 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ▁▁▇▁▁ |
| CASAVA | 0 | 1 | 0.65 | 0.48 | 0.00 | 0.00 | 1.00 | 1.00 | 1.00 | ▅▁▁▁▇ |
| BRAND_DIET.MOONLIT | 0 | 1 | 0.35 | 0.48 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | ▇▁▁▁▅ |
| BRAND_HILL.MOISTURE.JUMPIN.FISH | 0 | 1 | 0.00 | 0.02 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| BRAND_MOONLIT | 0 | 1 | 0.01 | 0.10 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| BRAND_MYTHICAL.BEVERAGE | 0 | 1 | 0.23 | 0.42 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▂ |
| BRAND_SUPER.DUPER.JUICED | 0 | 1 | 0.21 | 0.41 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▂ |
| BRAND_SUPER.DUPER.PUNCHED | 0 | 1 | 0.06 | 0.24 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| BRAND_SUPER.DUPER.RECOVERY | 0 | 1 | 0.13 | 0.33 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| WEEK_OF_YEAR | 0 | 1 | 26.89 | 15.14 | 1.00 | 14.00 | 27.00 | 40.00 | 53.00 | ▇▇▇▇▇ |
| REGION_ARIZONA | 0 | 1 | 0.22 | 0.41 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▂ |
| REGION_CALI_NEVADA | 0 | 1 | 0.04 | 0.18 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| REGION_COLORADO | 0 | 1 | 0.12 | 0.32 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| REGION_DESERT_SW | 0 | 1 | 0.07 | 0.25 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| REGION_KANSAS | 0 | 1 | 0.02 | 0.13 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| REGION_MOUNTAIN | 0 | 1 | 0.10 | 0.30 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| REGION_NEWMEXICO | 0 | 1 | 0.04 | 0.20 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| REGION_NOCAL | 0 | 1 | 0.04 | 0.19 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| REGION_NORTHERN | 0 | 1 | 0.26 | 0.44 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | ▇▁▁▁▃ |
| REGION_PRAIRIE | 0 | 1 | 0.02 | 0.16 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| REGION_SOCAL | 0 | 1 | 0.08 | 0.27 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| ITEM_JUMPIN.FISH..CASAVA.JACK | 0 | 1 | 0.00 | 0.02 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| ITEM_MOONLIT..CASAVA | 0 | 1 | 0.01 | 0.10 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| ITEM_MOONLIT..SUNSET | 0 | 1 | 0.35 | 0.48 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | ▇▁▁▁▅ |
| ITEM_MYTHICAL.BEVERAGE.RESERVE..WHITE.CASAVA | 0 | 1 | 0.23 | 0.42 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▂ |
| ITEM_SUPER.DUPER.JUIC..CASAVA.SUNSET.GUAVA | 0 | 1 | 0.21 | 0.41 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▂ |
| ITEM_SUPER.DUPER.PITAYA…CASAVA. | 0 | 1 | 0.06 | 0.24 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| ITEM_SUPER.DUPER.RECOVERY..CASAVA.JACK | 0 | 1 | 0.13 | 0.33 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
#remove top one percent of unit sales to clean up outliers
df <- df %>%
filter(UNIT_SALES < quantile(UNIT_SALES, 0.99))
# Split the data
set.seed(123)
df_testtrn <- initial_split(df, prop = 0.8, strata = UNIT_SALES)
Train <- training(df_testtrn)
Test <- testing(df_testtrn)
# Prepare features and labels for XGBoost
train_features <- Train[, -which(names(Train) == "UNIT_SALES")]
train_labels <- Train$UNIT_SALES
test_features <- Test[, -which(names(Test) == "UNIT_SALES")]
test_labels <- Test$UNIT_SALES
# Convert data to DMatrix format
dtrain <- xgb.DMatrix(data = as.matrix(train_features), label = train_labels)
dtest <- xgb.DMatrix(data = as.matrix(test_features), label = test_labels)
# Define XGBoost parameters
set.seed(123)
params <- list(
booster = "gbtree",
objective = "reg:squarederror",
eval_metric = "rmse",
eta = 0.05,
max_depth = 4,
min_child_weight = 3,
subsample = 0.7,
colsample_bytree = 0.6,
lambda = 1,
alpha = 1
)
# Perform cross-validation to find the optimal number of boosting rounds
cv_results <- xgb.cv(
params = params,
data = dtrain,
nfold = 5,
nrounds = 500, # Changed from 'num_boost_round' to 'nrounds'
early_stopping_rounds = 10,
metrics = "rmse",
seed = 123
)
## [1] train-rmse:94.423501+0.564879 test-rmse:94.429789+2.236629
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 10 rounds.
##
## [2] train-rmse:89.908282+0.530045 test-rmse:89.939602+2.123119
## [3] train-rmse:86.177544+0.716596 test-rmse:86.218138+2.130073
## [4] train-rmse:82.613269+0.737682 test-rmse:82.644481+1.576368
## [5] train-rmse:79.238916+0.526982 test-rmse:79.269607+1.603306
## [6] train-rmse:76.271102+0.915397 test-rmse:76.314502+0.952345
## [7] train-rmse:73.478895+0.952286 test-rmse:73.539239+1.368307
## [8] train-rmse:70.289821+0.944773 test-rmse:70.355578+1.755051
## [9] train-rmse:67.536899+1.243970 test-rmse:67.607092+2.042894
## [10] train-rmse:64.652668+1.422891 test-rmse:64.721507+1.871742
## [11] train-rmse:62.350805+0.985078 test-rmse:62.426439+1.421181
## [12] train-rmse:60.192237+0.662183 test-rmse:60.282419+1.696546
## [13] train-rmse:58.136834+0.661374 test-rmse:58.230704+1.145594
## [14] train-rmse:55.699524+0.548763 test-rmse:55.817174+1.092142
## [15] train-rmse:53.850068+0.699961 test-rmse:53.981021+1.042214
## [16] train-rmse:51.876980+0.414495 test-rmse:52.023382+1.347430
## [17] train-rmse:50.197896+0.838561 test-rmse:50.359869+1.619395
## [18] train-rmse:48.591621+0.586041 test-rmse:48.743114+1.627147
## [19] train-rmse:47.064947+0.536198 test-rmse:47.220069+1.278363
## [20] train-rmse:45.413550+0.790492 test-rmse:45.577684+0.889518
## [21] train-rmse:43.840890+0.532687 test-rmse:44.020071+1.107817
## [22] train-rmse:42.135188+0.565035 test-rmse:42.302897+0.882858
## [23] train-rmse:40.703137+0.725647 test-rmse:40.897130+0.865613
## [24] train-rmse:39.545894+0.599196 test-rmse:39.757911+0.781891
## [25] train-rmse:38.234973+0.148406 test-rmse:38.454664+0.909886
## [26] train-rmse:36.826687+0.478578 test-rmse:37.064830+0.773145
## [27] train-rmse:35.842496+0.740995 test-rmse:36.087225+0.996049
## [28] train-rmse:34.742250+1.092850 test-rmse:34.998586+1.284910
## [29] train-rmse:33.843980+1.002732 test-rmse:34.121626+1.493937
## [30] train-rmse:32.624164+0.766119 test-rmse:32.917330+1.178317
## [31] train-rmse:31.647128+0.898285 test-rmse:31.955950+0.935267
## [32] train-rmse:30.891688+1.114189 test-rmse:31.213571+0.767665
## [33] train-rmse:30.186221+1.379518 test-rmse:30.519200+0.958765
## [34] train-rmse:29.523203+1.616932 test-rmse:29.862204+0.974877
## [35] train-rmse:28.538163+1.539523 test-rmse:28.892079+0.900026
## [36] train-rmse:27.762326+1.498970 test-rmse:28.127804+0.885392
## [37] train-rmse:26.994302+1.267090 test-rmse:27.389614+0.669507
## [38] train-rmse:26.305485+1.383793 test-rmse:26.719281+0.894376
## [39] train-rmse:25.663855+1.567955 test-rmse:26.100658+1.033330
## [40] train-rmse:25.048856+1.714990 test-rmse:25.485624+1.259763
## [41] train-rmse:24.342306+1.873567 test-rmse:24.796985+1.385206
## [42] train-rmse:23.923081+2.027080 test-rmse:24.385598+1.516420
## [43] train-rmse:23.409364+2.207352 test-rmse:23.885990+1.741626
## [44] train-rmse:22.746796+2.066117 test-rmse:23.246128+1.560959
## [45] train-rmse:22.091585+1.837371 test-rmse:22.607085+1.386395
## [46] train-rmse:21.731753+1.905501 test-rmse:22.256820+1.545300
## [47] train-rmse:21.377451+1.948375 test-rmse:21.922120+1.552681
## [48] train-rmse:20.725071+1.840607 test-rmse:21.290284+1.465257
## [49] train-rmse:20.191422+1.664769 test-rmse:20.775562+1.353613
## [50] train-rmse:19.993049+1.694086 test-rmse:20.591168+1.431328
## [51] train-rmse:19.532708+1.613293 test-rmse:20.153989+1.417581
## [52] train-rmse:19.310361+1.405906 test-rmse:19.949748+1.257371
## [53] train-rmse:18.791059+1.326396 test-rmse:19.443868+1.202560
## [54] train-rmse:18.366828+1.180110 test-rmse:19.040862+1.135314
## [55] train-rmse:18.037199+1.037793 test-rmse:18.735318+1.026703
## [56] train-rmse:17.807785+0.942841 test-rmse:18.516456+0.916915
## [57] train-rmse:17.504379+0.830789 test-rmse:18.224838+0.734766
## [58] train-rmse:17.247465+0.936421 test-rmse:17.963856+0.731252
## [59] train-rmse:16.861073+0.884173 test-rmse:17.590704+0.694104
## [60] train-rmse:16.503434+0.843735 test-rmse:17.236142+0.677478
## [61] train-rmse:16.270743+0.773643 test-rmse:17.024095+0.661509
## [62] train-rmse:15.946012+0.730867 test-rmse:16.708869+0.640579
## [63] train-rmse:15.690874+0.639437 test-rmse:16.462341+0.628947
## [64] train-rmse:15.550946+0.664645 test-rmse:16.329943+0.633535
## [65] train-rmse:15.395232+0.555433 test-rmse:16.186896+0.532441
## [66] train-rmse:15.186364+0.609705 test-rmse:15.983035+0.567788
## [67] train-rmse:14.934099+0.580698 test-rmse:15.739795+0.550844
## [68] train-rmse:14.859255+0.596816 test-rmse:15.671315+0.517718
## [69] train-rmse:14.712069+0.624446 test-rmse:15.533829+0.588984
## [70] train-rmse:14.609424+0.674122 test-rmse:15.440851+0.545555
## [71] train-rmse:14.431689+0.652090 test-rmse:15.270005+0.493945
## [72] train-rmse:14.214652+0.614051 test-rmse:15.068251+0.475983
## [73] train-rmse:14.101901+0.678612 test-rmse:14.958368+0.483423
## [74] train-rmse:13.940474+0.616369 test-rmse:14.810769+0.485027
## [75] train-rmse:13.839382+0.566226 test-rmse:14.723778+0.440718
## [76] train-rmse:13.776529+0.541372 test-rmse:14.672535+0.465048
## [77] train-rmse:13.634640+0.532002 test-rmse:14.541367+0.435932
## [78] train-rmse:13.562947+0.569526 test-rmse:14.475028+0.436714
## [79] train-rmse:13.480880+0.563831 test-rmse:14.400802+0.441132
## [80] train-rmse:13.399968+0.515959 test-rmse:14.333035+0.397400
## [81] train-rmse:13.311269+0.526961 test-rmse:14.241335+0.388183
## [82] train-rmse:13.238851+0.501600 test-rmse:14.180451+0.354451
## [83] train-rmse:13.153383+0.506962 test-rmse:14.101717+0.364892
## [84] train-rmse:13.040299+0.468068 test-rmse:13.994847+0.323290
## [85] train-rmse:12.976276+0.440868 test-rmse:13.934211+0.272554
## [86] train-rmse:12.876001+0.440570 test-rmse:13.843466+0.245439
## [87] train-rmse:12.837833+0.446379 test-rmse:13.813237+0.267606
## [88] train-rmse:12.726634+0.421145 test-rmse:13.713326+0.263538
## [89] train-rmse:12.632237+0.391109 test-rmse:13.630191+0.273255
## [90] train-rmse:12.583124+0.381197 test-rmse:13.586990+0.285065
## [91] train-rmse:12.507804+0.393508 test-rmse:13.517249+0.306539
## [92] train-rmse:12.436534+0.345306 test-rmse:13.455647+0.323864
## [93] train-rmse:12.382366+0.354636 test-rmse:13.401069+0.318172
## [94] train-rmse:12.307700+0.324397 test-rmse:13.332952+0.322444
## [95] train-rmse:12.259935+0.285180 test-rmse:13.291947+0.332098
## [96] train-rmse:12.185378+0.258604 test-rmse:13.228072+0.338491
## [97] train-rmse:12.132774+0.225158 test-rmse:13.185038+0.350379
## [98] train-rmse:12.109157+0.231535 test-rmse:13.166568+0.358834
## [99] train-rmse:12.049080+0.211221 test-rmse:13.109678+0.364546
## [100] train-rmse:12.025283+0.219898 test-rmse:13.086996+0.343101
## [101] train-rmse:11.996640+0.220021 test-rmse:13.054290+0.347579
## [102] train-rmse:11.937103+0.204931 test-rmse:13.002243+0.329586
## [103] train-rmse:11.913837+0.197424 test-rmse:12.983553+0.345856
## [104] train-rmse:11.864962+0.212852 test-rmse:12.934146+0.343722
## [105] train-rmse:11.833100+0.216399 test-rmse:12.909519+0.349901
## [106] train-rmse:11.780634+0.197509 test-rmse:12.865007+0.352172
## [107] train-rmse:11.740214+0.186502 test-rmse:12.830175+0.333159
## [108] train-rmse:11.699520+0.175699 test-rmse:12.800282+0.326029
## [109] train-rmse:11.654255+0.179992 test-rmse:12.762058+0.307050
## [110] train-rmse:11.612281+0.188797 test-rmse:12.726122+0.309783
## [111] train-rmse:11.585316+0.199804 test-rmse:12.699226+0.300895
## [112] train-rmse:11.534276+0.191970 test-rmse:12.658899+0.296237
## [113] train-rmse:11.506294+0.186861 test-rmse:12.637115+0.289781
## [114] train-rmse:11.483627+0.179950 test-rmse:12.619334+0.283347
## [115] train-rmse:11.447396+0.164999 test-rmse:12.596259+0.287002
## [116] train-rmse:11.417804+0.149416 test-rmse:12.572006+0.298226
## [117] train-rmse:11.390369+0.149782 test-rmse:12.549410+0.285281
## [118] train-rmse:11.360572+0.156754 test-rmse:12.522825+0.269345
## [119] train-rmse:11.336751+0.159148 test-rmse:12.504516+0.258688
## [120] train-rmse:11.302789+0.159444 test-rmse:12.479599+0.257272
## [121] train-rmse:11.274362+0.153231 test-rmse:12.457078+0.258326
## [122] train-rmse:11.251039+0.158820 test-rmse:12.439705+0.250874
## [123] train-rmse:11.223312+0.150795 test-rmse:12.415422+0.250744
## [124] train-rmse:11.206941+0.144019 test-rmse:12.400610+0.254780
## [125] train-rmse:11.183969+0.139203 test-rmse:12.384747+0.258436
## [126] train-rmse:11.168178+0.133761 test-rmse:12.367008+0.265690
## [127] train-rmse:11.147889+0.136341 test-rmse:12.352210+0.261907
## [128] train-rmse:11.120722+0.134464 test-rmse:12.330946+0.267542
## [129] train-rmse:11.099491+0.130357 test-rmse:12.318929+0.268324
## [130] train-rmse:11.085107+0.125664 test-rmse:12.303486+0.268993
## [131] train-rmse:11.067933+0.130130 test-rmse:12.290083+0.263536
## [132] train-rmse:11.049719+0.123584 test-rmse:12.274784+0.270379
## [133] train-rmse:11.036683+0.127678 test-rmse:12.266480+0.267169
## [134] train-rmse:11.019950+0.128477 test-rmse:12.252922+0.266477
## [135] train-rmse:11.001782+0.127762 test-rmse:12.238841+0.269840
## [136] train-rmse:10.989990+0.129173 test-rmse:12.230877+0.269184
## [137] train-rmse:10.980398+0.123641 test-rmse:12.223034+0.273530
## [138] train-rmse:10.958876+0.124088 test-rmse:12.206723+0.267360
## [139] train-rmse:10.943984+0.127018 test-rmse:12.194405+0.261342
## [140] train-rmse:10.928780+0.124809 test-rmse:12.184284+0.261544
## [141] train-rmse:10.913462+0.128738 test-rmse:12.172107+0.258786
## [142] train-rmse:10.905197+0.129417 test-rmse:12.167340+0.261002
## [143] train-rmse:10.893687+0.125926 test-rmse:12.160883+0.260992
## [144] train-rmse:10.880757+0.130225 test-rmse:12.150296+0.258350
## [145] train-rmse:10.864188+0.123246 test-rmse:12.134566+0.267313
## [146] train-rmse:10.849881+0.118824 test-rmse:12.124681+0.269151
## [147] train-rmse:10.836052+0.116219 test-rmse:12.120501+0.266855
## [148] train-rmse:10.823485+0.120433 test-rmse:12.113682+0.263464
## [149] train-rmse:10.814602+0.119949 test-rmse:12.107081+0.265855
## [150] train-rmse:10.802850+0.118043 test-rmse:12.096366+0.266058
## [151] train-rmse:10.792704+0.118098 test-rmse:12.087628+0.269421
## [152] train-rmse:10.779634+0.116968 test-rmse:12.081814+0.266354
## [153] train-rmse:10.770145+0.117520 test-rmse:12.072839+0.263568
## [154] train-rmse:10.757891+0.119364 test-rmse:12.064592+0.264345
## [155] train-rmse:10.746328+0.117855 test-rmse:12.058567+0.264535
## [156] train-rmse:10.733905+0.118792 test-rmse:12.050601+0.265413
## [157] train-rmse:10.726037+0.117383 test-rmse:12.047794+0.262352
## [158] train-rmse:10.716904+0.115650 test-rmse:12.043160+0.261114
## [159] train-rmse:10.704968+0.118632 test-rmse:12.035953+0.256891
## [160] train-rmse:10.695758+0.121593 test-rmse:12.029342+0.254690
## [161] train-rmse:10.686856+0.122063 test-rmse:12.021465+0.254712
## [162] train-rmse:10.671476+0.121936 test-rmse:12.012004+0.261159
## [163] train-rmse:10.657978+0.120141 test-rmse:12.002476+0.264562
## [164] train-rmse:10.643835+0.118957 test-rmse:11.996022+0.265270
## [165] train-rmse:10.634998+0.122638 test-rmse:11.991590+0.262836
## [166] train-rmse:10.623520+0.122395 test-rmse:11.983958+0.264650
## [167] train-rmse:10.608486+0.123443 test-rmse:11.975458+0.262516
## [168] train-rmse:10.598612+0.121364 test-rmse:11.970483+0.263055
## [169] train-rmse:10.585749+0.124381 test-rmse:11.965071+0.262456
## [170] train-rmse:10.575715+0.120026 test-rmse:11.958817+0.263633
## [171] train-rmse:10.565400+0.121808 test-rmse:11.952401+0.262631
## [172] train-rmse:10.554555+0.121021 test-rmse:11.943496+0.266167
## [173] train-rmse:10.544097+0.121158 test-rmse:11.937556+0.264368
## [174] train-rmse:10.536341+0.118625 test-rmse:11.931132+0.266129
## [175] train-rmse:10.527975+0.118904 test-rmse:11.925092+0.269025
## [176] train-rmse:10.521286+0.120177 test-rmse:11.919370+0.266214
## [177] train-rmse:10.514935+0.120988 test-rmse:11.912995+0.264493
## [178] train-rmse:10.507321+0.121856 test-rmse:11.907298+0.263496
## [179] train-rmse:10.498849+0.122030 test-rmse:11.901932+0.266602
## [180] train-rmse:10.491328+0.122195 test-rmse:11.899815+0.267174
## [181] train-rmse:10.486488+0.121816 test-rmse:11.895876+0.267388
## [182] train-rmse:10.480052+0.121095 test-rmse:11.892453+0.268777
## [183] train-rmse:10.469214+0.118359 test-rmse:11.885857+0.266848
## [184] train-rmse:10.458882+0.117653 test-rmse:11.882783+0.263278
## [185] train-rmse:10.449346+0.119374 test-rmse:11.873769+0.262627
## [186] train-rmse:10.441686+0.119436 test-rmse:11.869707+0.263850
## [187] train-rmse:10.434410+0.119637 test-rmse:11.864702+0.262681
## [188] train-rmse:10.427859+0.121835 test-rmse:11.858067+0.260299
## [189] train-rmse:10.420511+0.119828 test-rmse:11.853283+0.260302
## [190] train-rmse:10.416475+0.121323 test-rmse:11.852892+0.260348
## [191] train-rmse:10.407528+0.122435 test-rmse:11.847589+0.263252
## [192] train-rmse:10.401573+0.122790 test-rmse:11.845284+0.264532
## [193] train-rmse:10.394319+0.121953 test-rmse:11.842149+0.263996
## [194] train-rmse:10.386737+0.120174 test-rmse:11.838062+0.268013
## [195] train-rmse:10.379160+0.119693 test-rmse:11.833990+0.266249
## [196] train-rmse:10.369201+0.116919 test-rmse:11.828442+0.268307
## [197] train-rmse:10.363024+0.118158 test-rmse:11.824329+0.266970
## [198] train-rmse:10.353670+0.119726 test-rmse:11.817522+0.269475
## [199] train-rmse:10.347972+0.120360 test-rmse:11.812574+0.269308
## [200] train-rmse:10.343670+0.119633 test-rmse:11.810317+0.267702
## [201] train-rmse:10.336491+0.117971 test-rmse:11.801966+0.272140
## [202] train-rmse:10.331642+0.116164 test-rmse:11.800312+0.272235
## [203] train-rmse:10.326470+0.115138 test-rmse:11.796186+0.270347
## [204] train-rmse:10.322500+0.115728 test-rmse:11.795384+0.268866
## [205] train-rmse:10.314340+0.116658 test-rmse:11.791258+0.265634
## [206] train-rmse:10.306642+0.113469 test-rmse:11.783036+0.266227
## [207] train-rmse:10.296888+0.112116 test-rmse:11.778263+0.268071
## [208] train-rmse:10.289737+0.109828 test-rmse:11.776361+0.270061
## [209] train-rmse:10.282643+0.109563 test-rmse:11.771807+0.269347
## [210] train-rmse:10.275083+0.109745 test-rmse:11.769083+0.271272
## [211] train-rmse:10.269478+0.109440 test-rmse:11.769812+0.273000
## [212] train-rmse:10.261988+0.106413 test-rmse:11.766275+0.275687
## [213] train-rmse:10.253878+0.105076 test-rmse:11.763889+0.278427
## [214] train-rmse:10.247451+0.106507 test-rmse:11.762031+0.277255
## [215] train-rmse:10.241958+0.105642 test-rmse:11.761870+0.277619
## [216] train-rmse:10.237699+0.107114 test-rmse:11.759694+0.277054
## [217] train-rmse:10.231891+0.105062 test-rmse:11.758128+0.279839
## [218] train-rmse:10.225415+0.104878 test-rmse:11.757132+0.276595
## [219] train-rmse:10.218398+0.103468 test-rmse:11.754181+0.279821
## [220] train-rmse:10.213049+0.103789 test-rmse:11.750556+0.281062
## [221] train-rmse:10.206493+0.103893 test-rmse:11.747822+0.280605
## [222] train-rmse:10.202820+0.103726 test-rmse:11.746540+0.281192
## [223] train-rmse:10.197924+0.104611 test-rmse:11.745034+0.279740
## [224] train-rmse:10.189635+0.105291 test-rmse:11.741534+0.279473
## [225] train-rmse:10.183359+0.103895 test-rmse:11.738969+0.282125
## [226] train-rmse:10.177005+0.105954 test-rmse:11.736057+0.282463
## [227] train-rmse:10.170425+0.106724 test-rmse:11.732551+0.283410
## [228] train-rmse:10.166146+0.108250 test-rmse:11.730971+0.282809
## [229] train-rmse:10.158921+0.108680 test-rmse:11.728621+0.284619
## [230] train-rmse:10.152107+0.107596 test-rmse:11.725676+0.286654
## [231] train-rmse:10.147019+0.107159 test-rmse:11.722212+0.288108
## [232] train-rmse:10.140885+0.108243 test-rmse:11.717070+0.284521
## [233] train-rmse:10.134435+0.109644 test-rmse:11.714316+0.285895
## [234] train-rmse:10.127139+0.107514 test-rmse:11.712008+0.286631
## [235] train-rmse:10.122030+0.108286 test-rmse:11.712499+0.290981
## [236] train-rmse:10.117738+0.109970 test-rmse:11.710000+0.291370
## [237] train-rmse:10.111053+0.110876 test-rmse:11.705222+0.292922
## [238] train-rmse:10.105943+0.110041 test-rmse:11.702076+0.294920
## [239] train-rmse:10.099386+0.109495 test-rmse:11.701959+0.295016
## [240] train-rmse:10.094201+0.110254 test-rmse:11.701848+0.295203
## [241] train-rmse:10.089585+0.111734 test-rmse:11.699849+0.295362
## [242] train-rmse:10.085250+0.112879 test-rmse:11.698393+0.295805
## [243] train-rmse:10.080497+0.112772 test-rmse:11.697715+0.295748
## [244] train-rmse:10.074614+0.112957 test-rmse:11.694189+0.296022
## [245] train-rmse:10.069259+0.111278 test-rmse:11.694641+0.294587
## [246] train-rmse:10.062057+0.113062 test-rmse:11.693160+0.296718
## [247] train-rmse:10.058178+0.113877 test-rmse:11.692823+0.297884
## [248] train-rmse:10.053034+0.114716 test-rmse:11.692909+0.297321
## [249] train-rmse:10.047810+0.115184 test-rmse:11.692634+0.296580
## [250] train-rmse:10.042583+0.114706 test-rmse:11.693211+0.295669
## [251] train-rmse:10.037320+0.115024 test-rmse:11.692880+0.296027
## [252] train-rmse:10.032092+0.115114 test-rmse:11.690263+0.296860
## [253] train-rmse:10.025446+0.114803 test-rmse:11.687016+0.295696
## [254] train-rmse:10.020170+0.112639 test-rmse:11.687999+0.292155
## [255] train-rmse:10.015513+0.111358 test-rmse:11.685568+0.291215
## [256] train-rmse:10.011118+0.111030 test-rmse:11.682470+0.293321
## [257] train-rmse:10.006122+0.113597 test-rmse:11.679468+0.292517
## [258] train-rmse:10.001034+0.113699 test-rmse:11.678239+0.293727
## [259] train-rmse:9.992893+0.113234 test-rmse:11.674524+0.295671
## [260] train-rmse:9.986237+0.114196 test-rmse:11.671711+0.297473
## [261] train-rmse:9.980375+0.112008 test-rmse:11.670387+0.297444
## [262] train-rmse:9.975897+0.111892 test-rmse:11.667394+0.295076
## [263] train-rmse:9.971651+0.112937 test-rmse:11.666498+0.292377
## [264] train-rmse:9.966813+0.113842 test-rmse:11.663101+0.290774
## [265] train-rmse:9.962066+0.112874 test-rmse:11.661545+0.290718
## [266] train-rmse:9.957288+0.113517 test-rmse:11.661857+0.290329
## [267] train-rmse:9.951921+0.113189 test-rmse:11.659783+0.289971
## [268] train-rmse:9.948524+0.112775 test-rmse:11.659185+0.285973
## [269] train-rmse:9.944772+0.113035 test-rmse:11.658491+0.286831
## [270] train-rmse:9.939231+0.112182 test-rmse:11.657093+0.288896
## [271] train-rmse:9.933861+0.111107 test-rmse:11.654726+0.290305
## [272] train-rmse:9.929718+0.111316 test-rmse:11.653138+0.290309
## [273] train-rmse:9.927375+0.111445 test-rmse:11.652688+0.288242
## [274] train-rmse:9.924071+0.113083 test-rmse:11.651996+0.287778
## [275] train-rmse:9.919168+0.112545 test-rmse:11.648013+0.286619
## [276] train-rmse:9.915495+0.112335 test-rmse:11.646905+0.286433
## [277] train-rmse:9.909329+0.112322 test-rmse:11.645413+0.287090
## [278] train-rmse:9.904764+0.109135 test-rmse:11.641793+0.294109
## [279] train-rmse:9.898694+0.109091 test-rmse:11.641079+0.296029
## [280] train-rmse:9.892005+0.110085 test-rmse:11.639135+0.297070
## [281] train-rmse:9.886005+0.111373 test-rmse:11.636627+0.297516
## [282] train-rmse:9.880045+0.111247 test-rmse:11.633712+0.297749
## [283] train-rmse:9.875419+0.111184 test-rmse:11.633593+0.297874
## [284] train-rmse:9.870295+0.110531 test-rmse:11.629010+0.298539
## [285] train-rmse:9.866392+0.110652 test-rmse:11.627810+0.299442
## [286] train-rmse:9.864210+0.110679 test-rmse:11.627153+0.300475
## [287] train-rmse:9.859903+0.110822 test-rmse:11.627588+0.299888
## [288] train-rmse:9.854349+0.109776 test-rmse:11.629237+0.301193
## [289] train-rmse:9.850878+0.110148 test-rmse:11.631062+0.303298
## [290] train-rmse:9.844649+0.109702 test-rmse:11.629525+0.301895
## [291] train-rmse:9.840848+0.110093 test-rmse:11.628663+0.302994
## [292] train-rmse:9.836114+0.110758 test-rmse:11.627901+0.298038
## [293] train-rmse:9.833322+0.111288 test-rmse:11.628910+0.298035
## [294] train-rmse:9.829116+0.110059 test-rmse:11.625981+0.298663
## [295] train-rmse:9.824431+0.109547 test-rmse:11.623817+0.295006
## [296] train-rmse:9.820727+0.109272 test-rmse:11.623574+0.296571
## [297] train-rmse:9.816866+0.110528 test-rmse:11.623656+0.297468
## [298] train-rmse:9.811853+0.111282 test-rmse:11.622669+0.299796
## [299] train-rmse:9.807711+0.110145 test-rmse:11.624863+0.303061
## [300] train-rmse:9.802438+0.109806 test-rmse:11.624078+0.303363
## [301] train-rmse:9.794488+0.109722 test-rmse:11.622248+0.299770
## [302] train-rmse:9.792505+0.108976 test-rmse:11.618906+0.300238
## [303] train-rmse:9.787433+0.109769 test-rmse:11.619850+0.300772
## [304] train-rmse:9.783507+0.110596 test-rmse:11.618402+0.301857
## [305] train-rmse:9.778125+0.110629 test-rmse:11.617518+0.301866
## [306] train-rmse:9.773852+0.109781 test-rmse:11.614607+0.301967
## [307] train-rmse:9.770200+0.110850 test-rmse:11.614894+0.303174
## [308] train-rmse:9.766558+0.109987 test-rmse:11.614656+0.302950
## [309] train-rmse:9.761032+0.112371 test-rmse:11.614210+0.302769
## [310] train-rmse:9.756223+0.113247 test-rmse:11.614787+0.304447
## [311] train-rmse:9.752558+0.113770 test-rmse:11.614758+0.305009
## [312] train-rmse:9.747884+0.112335 test-rmse:11.614217+0.306547
## [313] train-rmse:9.743179+0.109777 test-rmse:11.614103+0.307832
## [314] train-rmse:9.738652+0.108930 test-rmse:11.614955+0.308096
## [315] train-rmse:9.733948+0.108178 test-rmse:11.616372+0.307989
## [316] train-rmse:9.730620+0.108361 test-rmse:11.617003+0.307962
## [317] train-rmse:9.726570+0.109076 test-rmse:11.614043+0.308001
## [318] train-rmse:9.723382+0.110241 test-rmse:11.611849+0.307932
## [319] train-rmse:9.718988+0.110876 test-rmse:11.608956+0.307970
## [320] train-rmse:9.713847+0.109053 test-rmse:11.605468+0.306933
## [321] train-rmse:9.711410+0.109898 test-rmse:11.605565+0.307286
## [322] train-rmse:9.708915+0.108317 test-rmse:11.603922+0.309464
## [323] train-rmse:9.703443+0.108274 test-rmse:11.603704+0.313423
## [324] train-rmse:9.697856+0.107717 test-rmse:11.603635+0.314175
## [325] train-rmse:9.695111+0.109146 test-rmse:11.603186+0.315231
## [326] train-rmse:9.690672+0.108230 test-rmse:11.602044+0.317116
## [327] train-rmse:9.686129+0.107775 test-rmse:11.599761+0.315235
## [328] train-rmse:9.680981+0.106944 test-rmse:11.599537+0.312821
## [329] train-rmse:9.676653+0.107745 test-rmse:11.597144+0.311798
## [330] train-rmse:9.673532+0.108771 test-rmse:11.597773+0.312263
## [331] train-rmse:9.668215+0.109314 test-rmse:11.599049+0.310616
## [332] train-rmse:9.664005+0.107694 test-rmse:11.598246+0.309840
## [333] train-rmse:9.659995+0.107914 test-rmse:11.597746+0.311572
## [334] train-rmse:9.656654+0.106992 test-rmse:11.596983+0.306713
## [335] train-rmse:9.654323+0.108061 test-rmse:11.595509+0.307250
## [336] train-rmse:9.651270+0.106869 test-rmse:11.593700+0.306741
## [337] train-rmse:9.648515+0.107277 test-rmse:11.593453+0.309154
## [338] train-rmse:9.644154+0.107641 test-rmse:11.591023+0.310441
## [339] train-rmse:9.640006+0.106843 test-rmse:11.587395+0.310640
## [340] train-rmse:9.637847+0.107314 test-rmse:11.588431+0.309527
## [341] train-rmse:9.634123+0.108150 test-rmse:11.587590+0.311398
## [342] train-rmse:9.631540+0.107001 test-rmse:11.585591+0.312486
## [343] train-rmse:9.627174+0.108760 test-rmse:11.585592+0.313332
## [344] train-rmse:9.623977+0.108561 test-rmse:11.583761+0.312515
## [345] train-rmse:9.618891+0.108821 test-rmse:11.581696+0.312003
## [346] train-rmse:9.614607+0.108692 test-rmse:11.582797+0.311249
## [347] train-rmse:9.610552+0.108283 test-rmse:11.580989+0.312317
## [348] train-rmse:9.607445+0.109003 test-rmse:11.578476+0.313047
## [349] train-rmse:9.602383+0.109451 test-rmse:11.577757+0.312094
## [350] train-rmse:9.598573+0.109278 test-rmse:11.577160+0.311734
## [351] train-rmse:9.593730+0.108683 test-rmse:11.576241+0.309758
## [352] train-rmse:9.588358+0.104908 test-rmse:11.574972+0.307457
## [353] train-rmse:9.585886+0.103523 test-rmse:11.574300+0.307968
## [354] train-rmse:9.581291+0.103941 test-rmse:11.577074+0.305933
## [355] train-rmse:9.576187+0.102130 test-rmse:11.577054+0.306546
## [356] train-rmse:9.573779+0.102921 test-rmse:11.576134+0.307969
## [357] train-rmse:9.569282+0.104422 test-rmse:11.575895+0.306652
## [358] train-rmse:9.565969+0.105194 test-rmse:11.576764+0.306549
## [359] train-rmse:9.561743+0.104538 test-rmse:11.573963+0.304665
## [360] train-rmse:9.557883+0.101901 test-rmse:11.573424+0.304301
## [361] train-rmse:9.555766+0.102088 test-rmse:11.574548+0.305296
## [362] train-rmse:9.551908+0.103050 test-rmse:11.574672+0.302258
## [363] train-rmse:9.548805+0.102930 test-rmse:11.573079+0.302566
## [364] train-rmse:9.544446+0.104492 test-rmse:11.572657+0.303611
## [365] train-rmse:9.541485+0.104880 test-rmse:11.570791+0.302366
## [366] train-rmse:9.537356+0.105027 test-rmse:11.571097+0.302315
## [367] train-rmse:9.531125+0.108881 test-rmse:11.569984+0.301482
## [368] train-rmse:9.526500+0.110069 test-rmse:11.570358+0.302356
## [369] train-rmse:9.523448+0.109642 test-rmse:11.570159+0.301838
## [370] train-rmse:9.520043+0.108963 test-rmse:11.568514+0.304369
## [371] train-rmse:9.515778+0.110269 test-rmse:11.568075+0.304310
## [372] train-rmse:9.514139+0.110325 test-rmse:11.568313+0.305162
## [373] train-rmse:9.511214+0.110415 test-rmse:11.565174+0.304070
## [374] train-rmse:9.508784+0.110148 test-rmse:11.564073+0.304040
## [375] train-rmse:9.506206+0.109955 test-rmse:11.564084+0.304308
## [376] train-rmse:9.502621+0.110139 test-rmse:11.561988+0.304031
## [377] train-rmse:9.498560+0.111276 test-rmse:11.560506+0.304451
## [378] train-rmse:9.496073+0.111857 test-rmse:11.561769+0.305079
## [379] train-rmse:9.492458+0.111121 test-rmse:11.561537+0.306828
## [380] train-rmse:9.488880+0.109928 test-rmse:11.563527+0.303939
## [381] train-rmse:9.485588+0.110190 test-rmse:11.563078+0.303515
## [382] train-rmse:9.483100+0.109725 test-rmse:11.560601+0.304728
## [383] train-rmse:9.479337+0.110704 test-rmse:11.562949+0.306564
## [384] train-rmse:9.475900+0.110747 test-rmse:11.564237+0.305962
## [385] train-rmse:9.471268+0.110484 test-rmse:11.563445+0.305797
## [386] train-rmse:9.468912+0.109136 test-rmse:11.563232+0.305705
## [387] train-rmse:9.466473+0.109195 test-rmse:11.562778+0.305902
## Stopping. Best iteration:
## [377] train-rmse:9.498560+0.111276 test-rmse:11.560506+0.304451
best_nrounds <- cv_results$best_iteration
# Train the final model using the best number of rounds found
model_xgb <- xgb.train(
params = params,
data = dtrain,
nrounds = best_nrounds
)
# Make predictions and evaluate the model
train_pred <- predict(model_xgb, dtrain)
test_pred <- predict(model_xgb, dtest)
train_rmse <- sqrt(mean((train_labels - train_pred)^2))
test_rmse <- sqrt(mean((test_labels - test_pred)^2))
# Calculate R-squared for the training set
sst_train <- sum((train_labels - mean(train_labels)) ^ 2)
ssr_train <- sum((train_labels - train_pred) ^ 2)
r_squared_train <- 1 - (ssr_train / sst_train)
# Calculate R-squared for the test set
sst_test <- sum((test_labels - mean(test_labels)) ^ 2)
ssr_test <- sum((test_labels - test_pred) ^ 2)
r_squared_test <- 1 - (ssr_test / sst_test)
train_mape <- mean(abs((train_labels - train_pred) / train_labels)) * 100
test_mape <- mean(abs((test_labels - test_pred) / test_labels)) * 100
train_mae <- mean(abs(train_labels - train_pred))
test_mae <- mean(abs(test_labels - test_pred))
cat("Model Performance Metrics:\n",
"--------------------------\n",
"Training RMSE: ", train_rmse, "\n",
"Test RMSE: ", test_rmse, "\n",
"Training R-squared: ", r_squared_train, "\n",
"Test R-squared: ", r_squared_test, "\n",
"Training MAE: ", train_mae, "\n",
"Test MAE: ", test_mae, "\n",
"Training MAPE: ", train_mape, "%\n",
"Test MAPE: ", test_mape, "%\n", sep="")
## Model Performance Metrics:
## --------------------------
## Training RMSE: 9.726129
## Test RMSE: 11.42459
## Training R-squared: 0.9812686
## Test R-squared: 0.9750236
## Training MAE: 6.503416
## Test MAE: 7.239086
## Training MAPE: 20.10542%
## Test MAPE: 23.55032%
# Correcting Residuals Data Frame
# Assuming 'train_labels' and 'test_labels' contain the actual values,
# and 'train_pred' and 'test_pred' contain your model's predictions:
residuals_train <- train_labels - train_pred
residuals_test <- test_labels - test_pred
residuals_data <- data.frame(
Residuals = c(residuals_train, residuals_test),
Dataset = c(rep('Training', length(residuals_train)), rep('Test', length(residuals_test)))
)
# Now plotting residuals with corrected data
ggplot(residuals_data, aes(x = Residuals, fill = Dataset)) +
geom_histogram(binwidth = 1, position = 'identity', alpha = 0.6) +
facet_wrap(~ Dataset) +
ggtitle('Residuals Distribution')
# Assuming train_labels, test_labels, train_pred, and test_pred are correctly defined
# Adjusted Actual vs. Predicted Data Preparation
actual_pred_data <- data.frame(
Actual = c(train_labels, test_labels),
Predicted = c(train_pred, test_pred),
Dataset = c(rep('Training', length(train_labels)), rep('Test', length(test_labels)))
)
# Plotting Actual vs. Predicted Values
ggplot(actual_pred_data, aes(x = Actual, y = Predicted, colour = Dataset)) +
geom_point(alpha = 0.6) +
geom_abline(intercept = 0, slope = 1, linetype = 'dashed', color = 'red') +
xlab('Actual Values') +
ylab('Predicted Values') +
scale_colour_manual(values = c('Training' = 'blue', 'Test' = 'red')) +
ggtitle('Actual vs. Predicted Values')
library(xgboost)
# Calculate feature importance
importance_matrix <- xgb.importance(feature_names = colnames(train_features), model = model_xgb)
# View the feature importance scores
print(importance_matrix)
## Feature Gain Cover
## 1: DOLLAR_SALES 7.492052e-01 0.4515138155
## 2: RESERVE 9.069002e-02 0.0647068095
## 3: WHITE 5.722103e-02 0.0265604428
## 4: BRAND_MYTHICAL.BEVERAGE 4.360819e-02 0.0128073188
## 5: WEEK_OF_YEAR 8.122528e-03 0.1212281935
## 6: REGION_PRAIRIE 7.327083e-03 0.0178425217
## 7: REGION_COLORADO 7.013103e-03 0.0145408679
## 8: REGION_MOUNTAIN 4.925610e-03 0.0223640637
## 9: REGION_NORTHERN 4.844435e-03 0.0276914310
## 10: SUNSET 4.108095e-03 0.0136490055
## 11: CALORIC_SEGMENT 3.878993e-03 0.0273677749
## 12: REGION_NOCAL 2.564525e-03 0.0142378547
## 13: REGION_ARIZONA 2.141366e-03 0.0186326765
## 14: ITEM_MYTHICAL.BEVERAGE.RESERVE..WHITE.CASAVA 1.976363e-03 0.0024233524
## 15: CASAVA 1.922049e-03 0.0090394677
## 16: BRAND_DIET.MOONLIT 1.842857e-03 0.0048839223
## 17: NO_ARTIFICIAL_SWEETNERS 1.445095e-03 0.0160074158
## 18: REGION_SOCAL 1.022467e-03 0.0083460480
## 19: GUAVA 9.304661e-04 0.0117596555
## 20: REGION_KANSAS 8.220668e-04 0.0132108342
## 21: REGION_NEWMEXICO 7.997970e-04 0.0224282525
## 22: REGION_DESERT_SW 6.937068e-04 0.0150799934
## 23: BRAND_SUPER.DUPER.JUICED 6.924043e-04 0.0056749812
## 24: RECOVERY 4.225686e-04 0.0027870888
## 25: ENERGY_DRINK 3.839939e-04 0.0022276218
## 26: REGION_CALI_NEVADA 2.648522e-04 0.0135579961
## 27: X16SMALL.MULTI.CUP 2.439811e-04 0.0060867838
## 28: ITEM_SUPER.DUPER.PITAYA...CASAVA. 2.349725e-04 0.0005249618
## 29: PITAYA 2.272659e-04 0.0081366058
## 30: ITEM_SUPER.DUPER.JUIC..CASAVA.SUNSET.GUAVA 1.105918e-04 0.0027785002
## 31: JACK 8.119516e-05 0.0036155158
## 32: BRAND_SUPER.DUPER.PUNCHED 7.506285e-05 0.0052417823
## 33: BRAND_MOONLIT 6.332375e-05 0.0071088319
## 34: X2L.MULTI.JUG 3.444163e-05 0.0010898531
## 35: BRAND_SUPER.DUPER.RECOVERY 3.389506e-05 0.0011487682
## 36: ITEM_MOONLIT..CASAVA 1.471366e-05 0.0028583595
## 37: ITEM_MOONLIT..SUNSET 1.170011e-05 0.0008406319
## Feature Gain Cover
## Frequency
## 1: 0.3458882611
## 2: 0.0539861896
## 3: 0.0253191044
## 4: 0.0108809374
## 5: 0.1895794099
## 6: 0.0238543628
## 7: 0.0282485876
## 8: 0.0301318267
## 9: 0.0307595731
## 10: 0.0182046453
## 11: 0.0278300900
## 12: 0.0171584013
## 13: 0.0288763340
## 14: 0.0018832392
## 15: 0.0058589663
## 16: 0.0041849759
## 17: 0.0152751622
## 18: 0.0131826742
## 19: 0.0142289182
## 20: 0.0133919230
## 21: 0.0161121574
## 22: 0.0125549278
## 23: 0.0069052103
## 24: 0.0094161959
## 25: 0.0096254447
## 26: 0.0106716886
## 27: 0.0050219711
## 28: 0.0008369952
## 29: 0.0050219711
## 30: 0.0039757271
## 31: 0.0056497175
## 32: 0.0037664783
## 33: 0.0054404687
## 34: 0.0010462440
## 35: 0.0023017368
## 36: 0.0023017368
## 37: 0.0006277464
## Frequency
# Plot the feature importance
xgb.plot.importance(importance_matrix = importance_matrix)
# Compute partial dependence data for 'DOLLAR_SALES' and 'CASAVA', CALORIC_SEGMENT, and "ENERGY
# pd <- partial(model_xgb, pred.var = c("DOLLAR_SALES", "CASAVA", "CALORIC_SEGMENT", ENERGY"), train = train_features, grid.resolution = 20)
#
# # Default PDP
# pdp1 <- plotPartial(pd, plot = TRUE)
#
# # Add contour lines and use a different color palette
# rwb <- colorRampPalette(c("red", "white", "blue"))
# pdp2 <- plotPartial(pd, contour = TRUE, col.regions = rwb)
#
# # 3-D surface
# pdp3 <- plotPartial(pd, levelplot = FALSE, zlab = "Predicted Outcome", drape = TRUE, colorkey = TRUE, screen = list(z = -20, x = -60))
#
# # Combine plots into one window
# grid.arrange(pdp1, pdp2, pdp3, ncol = 3)
Model with NO DOLLAR SALES Variable
# Assuming 'df' is your complete dataframe and 'UNIT_SALES' is your target variable
df2 <- df
# Remove DOLLAR_SALES from the features
df2$DOLLAR_SALES <- NULL
# Split the updated data into training and testing sets (assuming you're using a similar approach as before)
set.seed(123)
df2_testtrn <- initial_split(df2, prop = 0.8, strata = UNIT_SALES)
Train <- training(df2_testtrn)
Test <- testing(df2_testtrn)
# Prepare features and labels for XGBoost, excluding DOLLAR_SALES
train_features <- Train[, -which(names(Train) == "UNIT_SALES")]
train_labels <- Train$UNIT_SALES
test_features <- Test[, -which(names(Test) == "UNIT_SALES")]
test_labels <- Test$UNIT_SALES
# Convert data to DMatrix format for XGBoost
dtrain <- xgb.DMatrix(data = as.matrix(train_features), label = train_labels)
dtest <- xgb.DMatrix(data = as.matrix(test_features), label = test_labels)
# Assuming 'params' and 'best_nrounds' are defined as before
# Train the final model without DOLLAR_SALES
model_xgb_no_dollar_sales <- xgb.train(
params = params,
data = dtrain,
nrounds = best_nrounds
)
# Make predictions and evaluate the model
train_pred <- predict(model_xgb_no_dollar_sales, dtrain)
test_pred <- predict(model_xgb_no_dollar_sales, dtest)
train_rmse <- sqrt(mean((train_labels - train_pred)^2))
test_rmse <- sqrt(mean((test_labels - test_pred)^2))
# Calculate R-squared for the training set
sst_train <- sum((train_labels - mean(train_labels)) ^ 2)
ssr_train <- sum((train_labels - train_pred) ^ 2)
r_squared_train <- 1 - (ssr_train / sst_train)
# Calculate R-squared for the test set
sst_test <- sum((test_labels - mean(test_labels)) ^ 2)
ssr_test <- sum((test_labels - test_pred) ^ 2)
r_squared_test <- 1 - (ssr_test / sst_test)
train_mape <- mean(abs((train_labels - train_pred) / train_labels)) * 100
test_mape <- mean(abs((test_labels - test_pred) / test_labels)) * 100
train_mae <- mean(abs(train_labels - train_pred))
test_mae <- mean(abs(test_labels - test_pred))
cat("Model Performance Metrics:\n",
"--------------------------\n",
"Training RMSE: ", train_rmse, "\n",
"Test RMSE: ", test_rmse, "\n",
"Training R-squared: ", r_squared_train, "\n",
"Test R-squared: ", r_squared_test, "\n",
"Training MAE: ", train_mae, "\n",
"Test MAE: ", test_mae, "\n",
"Training MAPE: ", train_mape, "%\n",
"Test MAPE: ", test_mape, "%\n", sep="")
## Model Performance Metrics:
## --------------------------
## Training RMSE: 51.03556
## Test RMSE: 57.0461
## Training R-squared: 0.4842549
## Test R-squared: 0.3772679
## Training MAE: 34.92517
## Test MAE: 37.98006
## Training MAPE: 228.2699%
## Test MAPE: 274.0136%
# Calculate feature importance
importance_matrix2 <- xgb.importance(feature_names = colnames(train_features), model = model_xgb_no_dollar_sales)
# View the feature importance scores
print(importance_matrix2)
## Feature Gain Cover
## 1: RESERVE 0.3454217961 0.0694608014
## 2: WEEK_OF_YEAR 0.1207836505 0.2990349831
## 3: WHITE 0.0926169672 0.0233899482
## 4: REGION_COLORADO 0.0534639570 0.0365789726
## 5: BRAND_MYTHICAL.BEVERAGE 0.0422554479 0.0173343757
## 6: NO_ARTIFICIAL_SWEETNERS 0.0385900817 0.0202321984
## 7: REGION_PRAIRIE 0.0342124484 0.0430368565
## 8: REGION_MOUNTAIN 0.0340366041 0.0564988345
## 9: REGION_NORTHERN 0.0300446682 0.0421604359
## 10: SUNSET 0.0250295152 0.0164420504
## 11: RECOVERY 0.0214300745 0.0150953073
## 12: CALORIC_SEGMENT 0.0198704988 0.0232215107
## 13: REGION_NOCAL 0.0139207152 0.0264419702
## 14: REGION_ARIZONA 0.0127081104 0.0267797541
## 15: REGION_KANSAS 0.0123352008 0.0464927044
## 16: ITEM_MYTHICAL.BEVERAGE.RESERVE..WHITE.CASAVA 0.0122444139 0.0021805999
## 17: X16SMALL.MULTI.CUP 0.0113472983 0.0121597679
## 18: REGION_SOCAL 0.0093423100 0.0239332806
## 19: BRAND_SUPER.DUPER.RECOVERY 0.0081789856 0.0029252393
## 20: ENERGY_DRINK 0.0075478920 0.0088034375
## 21: CASAVA 0.0072047868 0.0082625287
## 22: REGION_DESERT_SW 0.0067948773 0.0371759262
## 23: GUAVA 0.0067516507 0.0232410507
## 24: REGION_NEWMEXICO 0.0063191294 0.0323737896
## 25: REGION_CALI_NEVADA 0.0055061894 0.0232621054
## 26: PITAYA 0.0042657907 0.0096262368
## 27: BRAND_DIET.MOONLIT 0.0042185355 0.0033749615
## 28: BRAND_SUPER.DUPER.JUICED 0.0035660876 0.0127109768
## 29: ITEM_MOONLIT..SUNSET 0.0025073756 0.0026621314
## 30: JACK 0.0022992594 0.0084698947
## 31: ITEM_SUPER.DUPER.PITAYA...CASAVA. 0.0010174978 0.0008600616
## 32: BRAND_SUPER.DUPER.PUNCHED 0.0009445124 0.0029850710
## 33: ITEM_SUPER.DUPER.JUIC..CASAVA.SUNSET.GUAVA 0.0009097770 0.0054184787
## 34: BRAND_MOONLIT 0.0008045329 0.0087837460
## 35: X2L.MULTI.JUG 0.0007361638 0.0019555116
## 36: ITEM_MOONLIT..CASAVA 0.0006138386 0.0062449134
## 37: ITEM_SUPER.DUPER.RECOVERY..CASAVA.JACK 0.0001593594 0.0003895876
## Feature Gain Cover
## Frequency
## 1: 0.055235658
## 2: 0.308381318
## 3: 0.020473448
## 4: 0.036468330
## 5: 0.015568351
## 6: 0.026444871
## 7: 0.030070377
## 8: 0.048197910
## 9: 0.046918319
## 10: 0.021113244
## 11: 0.030070377
## 12: 0.031349968
## 13: 0.027084666
## 14: 0.031776498
## 15: 0.030496908
## 16: 0.001706121
## 17: 0.012156110
## 18: 0.029857112
## 19: 0.004905097
## 20: 0.018767328
## 21: 0.009810194
## 22: 0.026658136
## 23: 0.025591811
## 24: 0.026658136
## 25: 0.017274472
## 26: 0.014075496
## 27: 0.002985711
## 28: 0.013222435
## 29: 0.002132651
## 30: 0.011942845
## 31: 0.002559181
## 32: 0.003198976
## 33: 0.004691832
## 34: 0.004265302
## 35: 0.003838772
## 36: 0.002559181
## 37: 0.001492856
## Frequency
xgb.plot.importance(importance_matrix = importance_matrix2)
if (!requireNamespace("pdp", quietly = TRUE)) install.packages("pdp")
if (!requireNamespace("xgboost", quietly = TRUE)) install.packages("xgboost")
library(pdp)
library(xgboost)
pdp::partial(model_xgb_no_dollar_sales, pred.var = "WEEK_OF_YEAR", train = train_features)
## WEEK_OF_YEAR yhat
## 1 1.00 76.95211
## 2 2.04 71.71872
## 3 3.08 69.83707
## 4 4.12 66.32345
## 5 5.16 60.81318
## 6 6.20 70.90967
## 7 7.24 68.43738
## 8 8.28 63.29930
## 9 9.32 65.10666
## 10 10.36 67.16075
## 11 11.40 65.23766
## 12 12.44 60.54461
## 13 13.48 62.33341
## 14 14.52 62.61749
## 15 15.56 63.04057
## 16 16.60 62.84779
## 17 17.64 65.46582
## 18 18.68 71.05545
## 19 19.72 74.71198
## 20 20.76 76.75920
## 21 21.80 76.67443
## 22 22.84 74.15679
## 23 23.88 72.66046
## 24 24.92 76.08584
## 25 25.96 76.45128
## 26 27.00 72.19917
## 27 28.04 65.85761
## 28 29.08 70.96944
## 29 30.12 64.56317
## 30 31.16 67.39512
## 31 32.20 78.84320
## 32 33.24 79.74898
## 33 34.28 79.04015
## 34 35.32 78.92823
## 35 36.36 77.01705
## 36 37.40 76.63151
## 37 38.44 74.05521
## 38 39.48 68.70417
## 39 40.52 54.57311
## 40 41.56 58.36970
## 41 42.60 57.34105
## 42 43.64 61.53831
## 43 44.68 65.07709
## 44 45.72 66.35301
## 45 46.76 70.20591
## 46 47.80 65.56802
## 47 48.84 64.82494
## 48 49.88 70.71167
## 49 50.92 67.44957
## 50 51.96 77.38542
## 51 53.00 62.14893
pd <- partial(model_xgb_no_dollar_sales, pred.var = "WEEK_OF_YEAR", train = train_features, grid.resolution = 20)
# Default PDP
pdp1 <- plotPartial(pd, plot = TRUE)
# plot
grid.arrange(pdp1)
Based on the Casava Energy Drink 2L MULTI JUG innovation datafram we expect the best 6 months to be between about weeks 17 and weeks 38.
rm(list = ls())
#skim(df) #same initial dataset used in previous models
# create a table of total values by brand
brand_summary <- df %>%
group_by(BRAND) %>%
summarise(
total_units_sold = sum(UNIT_SALES),
total_revenue = sum(DOLLAR_SALES),
avg_price = total_revenue / total_units_sold,
total_days_sold = n() # Count the number of rows for each brand
) %>%
arrange(desc(total_revenue)) %>% # Order by revenue in descending order
mutate(rank = row_number())
summary(brand_summary)
## BRAND total_units_sold total_revenue avg_price
## Length:288 Min. : 1 Min. : 1 Min. : 0.5315
## Class :character 1st Qu.: 2310 1st Qu.: 7563 1st Qu.: 2.0861
## Mode :character Median : 94691 Median : 266075 Median : 3.0291
## Mean : 1473003 Mean : 4989427 Mean : 3.2661
## 3rd Qu.: 651385 3rd Qu.: 2161764 3rd Qu.: 3.7252
## Max. :40414038 Max. :159387186 Max. :42.9378
## total_days_sold rank
## Min. : 1.0 Min. : 1.00
## 1st Qu.: 121.8 1st Qu.: 72.75
## Median : 1988.0 Median :144.50
## Mean : 8493.5 Mean :144.50
## 3rd Qu.: 8075.8 3rd Qu.:216.25
## Max. :124603.0 Max. :288.00
print(brand_summary[brand_summary$BRAND == "VENOMOUS BLAST", ])
## # A tibble: 1 × 6
## BRAND total_units_sold total_revenue avg_price total_days_sold rank
## <chr> <dbl> <dbl> <dbl> <int> <int>
## 1 VENOMOUS BLAST 360173 361370. 1.00 5188 130
VENOMOUS BLAST does have a decent amount of sales ranking 130 of 288 in total revenue. They surprisingly have a low average price and a low total days sold.
# Filter the dataframe for only 'Venomous Blast'
filtered_df <- df %>%
filter(BRAND == "VENOMOUS BLAST")
summary(filtered_df)
## MARKET_KEY DATE CALORIC_SEGMENT CATEGORY
## Length:5188 Length:5188 Min. :0.0000 Length:5188
## Class :character Class :character 1st Qu.:0.0000 Class :character
## Mode :character Mode :character Median :1.0000 Mode :character
## Mean :0.7406
## 3rd Qu.:1.0000
## Max. :1.0000
## UNIT_SALES DOLLAR_SALES MANUFACTURER BRAND
## Min. : 1.00 Min. : 0.50 Length:5188 Length:5188
## 1st Qu.: 6.00 1st Qu.: 5.92 Class :character Class :character
## Median : 16.00 Median : 16.64 Mode :character Mode :character
## Mean : 69.42 Mean : 69.66
## 3rd Qu.: 41.00 3rd Qu.: 42.20
## Max. :3298.00 Max. :3199.67
## PACKAGE ITEM REGION MONTH
## Length:5188 Length:5188 Length:5188 Min. : 1.000
## Class :character Class :character Class :character 1st Qu.: 3.000
## Mode :character Mode :character Mode :character Median : 6.000
## Mean : 6.174
## 3rd Qu.: 9.000
## Max. :12.000
## SEASON
## Length:5188
## Class :character
## Mode :character
##
##
##
# Create the plot
ggplot(filtered_df, aes(x = UNIT_SALES, y = DOLLAR_SALES)) +
geom_point(color = "red", alpha = 1) + # Bright red points with full opacity
geom_smooth(method = "lm", color = "black", se = FALSE) + # Add linear regression line without confidence band
labs(title = "Linear Model of UNIT_SALES vs. DOLLAR_SALES for VENOMOUS BLAST",
x = "UNIT SALES",
y = "DOLLAR SALES") +
theme_minimal() +
theme(legend.position = "none")
## `geom_smooth()` using formula = 'y ~ x'
filtered_df %>%
mutate(DATE = as.Date(DATE)) %>%
mutate(WEEK = as.integer(format(DATE, "%U"))) %>%
group_by(WEEK) %>%
summarise(total_sales = sum(UNIT_SALES)) %>%
ggplot(aes(x = WEEK, y = total_sales)) +
geom_line(color = "black") + # Blue line connecting points
labs(title = "Total Sales by Week of the Year",
x = "Week of the Year",
y = "Total Unit Sales") +
theme_minimal()
> This shows that sales by week of the year of VENOMOUS BLAST is very
spread out
#find the best 13 weeks
library(zoo)
# Calculate total sales for each group of 13 consecutive weeks
sales_by_group <- filtered_df %>%
mutate(DATE = as.Date(DATE)) %>%
mutate(WEEK = as.integer(format(DATE, "%U"))) %>%
group_by(WEEK) %>%
summarise(total_sales = sum(UNIT_SALES)) %>%
mutate(sales_in_group = rollsum(total_sales, 13, align = "left", fill = NA)) %>%
mutate(week_label = paste0("Week ", WEEK + 1, " to Week ", WEEK + 13)) %>%
arrange(WEEK) %>% # Order by WEEK
filter(!is.na(sales_in_group)) # Remove rows with sales_in_group = NA
# Plot the bar chart
sales_by_group$week_label <- factor(sales_by_group$week_label, levels = sales_by_group$week_label[order(sales_by_group$WEEK)])
ggplot(sales_by_group, aes(x = factor(week_label), y = sales_in_group)) +
geom_bar(stat = "identity", fill = "black") +
labs(title = "Total Sales for Each 13-Week Grouping",
x = "Weeks (Starting from Week 1)",
y = "Total Sales") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
> From this graph we see that weeks 24 to 36 historically have the
highest unit sales of VENOMOUS BLAST
#find the best 13 weeks for Kiwano sales
# Calculate total sales for each group of 13 consecutive weeks
sales_by_kiwano <- df %>%
filter(str_detect(ITEM, "KIWANO")) %>%
mutate(DATE = as.Date(DATE)) %>%
mutate(WEEK = as.integer(format(DATE, "%U"))) %>%
group_by(WEEK) %>%
summarise(total_sales = sum(UNIT_SALES)) %>%
mutate(sales_in_group = rollsum(total_sales, 13, align = "left", fill = NA)) %>%
mutate(week_label = paste0("Week ", WEEK + 1, " to Week ", WEEK + 13)) %>%
arrange(WEEK) %>% # Order by WEEK
filter(!is.na(sales_in_group)) # Remove rows with sales_in_group = NA
# Plot the bar chart
sales_by_kiwano$week_label <- factor(sales_by_kiwano$week_label, levels = sales_by_kiwano$week_label[order(sales_by_kiwano$WEEK)])
ggplot(sales_by_kiwano, aes(x = factor(week_label), y = sales_in_group)) +
geom_bar(stat = "identity", fill = "black") +
labs(title = "Total Sales for Each 13-Week Grouping",
x = "Weeks (Starting from Week 1)",
y = "Total Sales") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
>This graph shows the best weeks sales of any kiwano drink is week 19
to 31.
#find the best 13 weeks for Kiwano sales
# Calculate total sales for each group of 13 consecutive weeks
sales_by_energy <- df %>%
filter(CATEGORY == "ENERGY",
str_detect(ITEM, "KIWANO"),
str_detect(PACKAGE, "16")) %>%
mutate(DATE = as.Date(DATE)) %>%
mutate(WEEK = as.integer(format(DATE, "%U"))) %>%
group_by(WEEK) %>%
summarise(total_sales = sum(UNIT_SALES)) %>%
mutate(sales_in_group = rollsum(total_sales, 13, align = "left", fill = NA)) %>%
mutate(week_label = paste0("Week ", WEEK + 1, " to Week ", WEEK + 13)) %>%
arrange(WEEK) %>% # Order by WEEK
filter(!is.na(sales_in_group)) # Remove rows with sales_in_group = NA
# Plot the bar chart
sales_by_energy$week_label <- factor(sales_by_energy$week_label, levels = sales_by_energy$week_label[order(sales_by_energy$WEEK)])
ggplot(sales_by_energy, aes(x = factor(week_label), y = sales_in_group)) +
geom_bar(stat = "identity", fill = "black") +
labs(title = "Total Sales for Each 13-Week Grouping",
x = "Weeks (Starting from Week 1)",
y = "Total Sales") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
>This graph shows the best weeks for sales of Energy drinks with
Kiwano flavors and packageing 16 is weeks 12 to 24
innovation <- df %>%
filter(CATEGORY == "ENERGY",
CALORIC_SEGMENT == 0,
str_detect(ITEM, "KIWANO"),
str_detect(PACKAGE, "16"))
print(unique(innovation$ITEM))
## [1] "MYTHICAL BEVERAGE ULTRA KIWANO ENERGY DRINK UNFLAVORED ZERO SUGAR CUP 16 LIQUID SMALL X4"
## [2] "SUPER-DUPER PURE ZERO ENERGY DRINK KIWANO KEKE SUGAR FREE CUP 16 LIQUID SMALL"
## [3] "RAINING JUMPIN-FISH GAME FUEL ZERO ENERGY DRINK CHARGED KIWANO SHOCK ZERO SUGAR CUP 16 LIQUID SMALL"
## [4] "MYTHICAL BEVERAGE ULTRA KIWANO ENERGY DRINK UNFLAVORED ZERO SUGAR CUP 16 LIQUID SMALL"
## [5] "SUPER-DUPER PURE ZERO ENERGY DRINK KIWANO ZERO SUGAR CUP 16 LIQUID SMALL"
## [6] "MYTHICAL BEVERAGE ULTRA KIWANO ENERGY DRINK UNFLAVORED ZERO SUGAR CUP 16 LIQUID SMALL X24"
## [7] "VENOMOUS BLAST ENERGY DRINK KIWANO DURIAN CUP 16 LIQUID SMALL"
## [8] "POW-POW GENTLE DRINK WYLDIN KIWANO CUP 16 LIQUID SMALL X12"
## [9] "MYTHICAL BEVERAGE REHAB ENERGY DRINK KIWANO CUP 15.5 LIQUID SMALL X24"
## [10] "SUPER-DUPER PURE ZERO ENERGY DRINK KIWANO ZERO SUGAR CUP 16 LIQUID SMALL X24"
#there are 10 items with energy, diet, kiwano that come in packs of 16, but none of them are from VENOMOUS BLAST.
library(dplyr)
library(lubridate)
innovation <- innovation %>%
mutate(
MONTH = month(ymd(DATE)), # Extract month using lubridate's ymd function
MONTH = as.factor(MONTH) # Convert the extracted month into a factor
)
str(innovation)
## 'data.frame': 8082 obs. of 13 variables:
## $ MARKET_KEY : chr "504" "953" "133" "817" ...
## $ DATE : chr "2022-02-26" "2022-08-20" "2020-12-19" "2022-02-05" ...
## $ CALORIC_SEGMENT: num 0 0 0 0 0 0 0 0 0 0 ...
## $ CATEGORY : chr "ENERGY" "ENERGY" "ENERGY" "ENERGY" ...
## $ UNIT_SALES : num 11 13 20 194 8 176 87 300 4 102 ...
## $ DOLLAR_SALES : num 78.9 21.8 40.5 287.1 63.4 ...
## $ MANUFACTURER : chr "PONYS" "JOLLYS" "JOLLYS" "JOLLYS" ...
## $ BRAND : chr "MYTHICAL BEVERAGE ULTRA" "SUPER-DUPER PURE ZERO" "HILL MOISTURE JUMPIN-FISH" "SUPER-DUPER PURE ZERO" ...
## $ PACKAGE : chr "16SMALL 4ONE CUP" "16SMALL MULTI CUP" "16SMALL MULTI CUP" "16SMALL MULTI CUP" ...
## $ ITEM : chr "MYTHICAL BEVERAGE ULTRA KIWANO ENERGY DRINK UNFLAVORED ZERO SUGAR CUP 16 LIQUID SMALL X4" "SUPER-DUPER PURE ZERO ENERGY DRINK KIWANO KEKE SUGAR FREE CUP 16 LIQUID SMALL" "RAINING JUMPIN-FISH GAME FUEL ZERO ENERGY DRINK CHARGED KIWANO SHOCK ZERO SUGAR CUP 16 LIQUID SMALL" "SUPER-DUPER PURE ZERO ENERGY DRINK KIWANO KEKE SUGAR FREE CUP 16 LIQUID SMALL" ...
## $ REGION : chr "NORTHERN" "ARIZONA" "MOUNTAIN" "COLORADO" ...
## $ MONTH : Factor w/ 12 levels "1","2","3","4",..: 2 8 12 2 5 5 10 8 5 8 ...
## $ SEASON : chr "WINTER" "SUMMER" "WINTER" "WINTER" ...
# Assuming 'innovation' is your data frame
model <- lm(DOLLAR_SALES ~ UNIT_SALES + CALORIC_SEGMENT + PACKAGE + SEASON + REGION, data = innovation)
summary(model)
##
## Call:
## lm(formula = DOLLAR_SALES ~ UNIT_SALES + CALORIC_SEGMENT + PACKAGE +
## SEASON + REGION, data = innovation)
##
## Residuals:
## Min 1Q Median 3Q Max
## -953.6 -35.4 -1.1 27.6 5847.8
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.571915 18.407162 -0.194 0.846141
## UNIT_SALES 2.179578 0.004342 501.957 < 2e-16 ***
## CALORIC_SEGMENT NA NA NA NA
## PACKAGE16SMALL 24ONE CUP 178.333322 19.021309 9.375 < 2e-16 ***
## PACKAGE16SMALL 4ONE CUP 62.481782 18.270952 3.420 0.000630 ***
## PACKAGE16SMALL MULTI CUP -9.985916 17.796875 -0.561 0.574742
## SEASONSPRING 7.749111 5.278459 1.468 0.142126
## SEASONSUMMER 0.158127 5.606383 0.028 0.977500
## SEASONWINTER -5.957836 5.296196 -1.125 0.260653
## REGIONCALI_NEVADA -6.656448 10.258577 -0.649 0.516443
## REGIONCOLORADO 19.756980 6.669432 2.962 0.003062 **
## REGIONDESERT_SW 0.165096 7.867662 0.021 0.983259
## REGIONKANSAS 170.758804 14.371366 11.882 < 2e-16 ***
## REGIONMOUNTAIN -0.897751 7.130829 -0.126 0.899816
## REGIONNEWMEXICO 15.665066 9.744594 1.608 0.107970
## REGIONNOCAL -18.993746 9.856390 -1.927 0.054009 .
## REGIONNORTHERN -6.707730 5.438244 -1.233 0.217449
## REGIONPRAIRIE 40.817170 11.563916 3.530 0.000418 ***
## REGIONSOCAL -14.067039 7.502472 -1.875 0.060831 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 164.2 on 8064 degrees of freedom
## Multiple R-squared: 0.975, Adjusted R-squared: 0.975
## F-statistic: 1.852e+04 on 17 and 8064 DF, p-value: < 2.2e-16
library(dplyr)
small_group <- df %>%
filter(UNIT_SALES < 3300, DOLLAR_SALES < 3200)
skim(small_group)
| Name | small_group |
| Number of rows | 2372840 |
| Number of columns | 13 |
| _______________________ | |
| Column type frequency: | |
| character | 9 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| MARKET_KEY | 0 | 1 | 1 | 4 | 0 | 200 | 0 |
| DATE | 0 | 1 | 10 | 10 | 0 | 152 | 0 |
| CATEGORY | 0 | 1 | 3 | 18 | 0 | 5 | 0 |
| MANUFACTURER | 0 | 1 | 5 | 8 | 0 | 8 | 0 |
| BRAND | 0 | 1 | 4 | 56 | 0 | 288 | 0 |
| PACKAGE | 0 | 1 | 11 | 26 | 0 | 95 | 0 |
| ITEM | 0 | 1 | 26 | 142 | 0 | 2999 | 0 |
| REGION | 0 | 1 | 5 | 11 | 0 | 11 | 0 |
| SEASON | 0 | 1 | 4 | 6 | 0 | 4 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| CALORIC_SEGMENT | 0 | 1 | 0.50 | 0.50 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | ▇▁▁▁▇ |
| UNIT_SALES | 0 | 1 | 104.83 | 183.49 | 0.04 | 10.00 | 38.00 | 113.00 | 3298.00 | ▇▁▁▁▁ |
| DOLLAR_SALES | 0 | 1 | 332.69 | 516.65 | 0.01 | 34.92 | 126.27 | 380.55 | 3199.98 | ▇▁▁▁▁ |
| MONTH | 0 | 1 | 6.28 | 3.44 | 1.00 | 3.00 | 6.00 | 9.00 | 12.00 | ▇▆▆▅▇ |
# Create a scatter plot with the regression line, colored by MANUFACTURER
ggplot(small_group, aes(x = UNIT_SALES, y = DOLLAR_SALES, color = MANUFACTURER)) +
geom_point(alpha = 0.5) + # Adjust alpha to avoid overplotting, if necessary
geom_smooth(method = "lm", color = "black", se = FALSE) + # Add linear regression line without confidence band for clarity
labs(title = "Linear Model of UNIT_SALES vs. DOLLAR_SALES by MANUFACTURER",
x = "UNTI SALES",
y = "DOLLAR SALES") +
theme_minimal() +
theme(legend.position = "bottom") # Adjust legend position if needed
## `geom_smooth()` using formula = 'y ~ x'
kiwano_small <- df[grep("kiwano", df$ITEM, ignore.case = TRUE), ]
skim(kiwano_small)
| Name | kiwano_small |
| Number of rows | 71256 |
| Number of columns | 13 |
| _______________________ | |
| Column type frequency: | |
| character | 9 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| MARKET_KEY | 0 | 1 | 1 | 4 | 0 | 200 | 0 |
| DATE | 0 | 1 | 10 | 10 | 0 | 152 | 0 |
| CATEGORY | 0 | 1 | 3 | 18 | 0 | 4 | 0 |
| MANUFACTURER | 0 | 1 | 5 | 8 | 0 | 7 | 0 |
| BRAND | 0 | 1 | 5 | 41 | 0 | 27 | 0 |
| PACKAGE | 0 | 1 | 12 | 23 | 0 | 28 | 0 |
| ITEM | 0 | 1 | 46 | 105 | 0 | 68 | 0 |
| REGION | 0 | 1 | 5 | 11 | 0 | 11 | 0 |
| SEASON | 0 | 1 | 4 | 6 | 0 | 4 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| CALORIC_SEGMENT | 0 | 1 | 0.34 | 0.48 | 0.00 | 0.00 | 0.0 | 1.00 | 1.00 | ▇▁▁▁▅ |
| UNIT_SALES | 0 | 1 | 101.93 | 384.04 | 0.50 | 8.00 | 26.0 | 76.00 | 16851.00 | ▇▁▁▁▁ |
| DOLLAR_SALES | 0 | 1 | 280.74 | 1016.57 | 0.01 | 28.25 | 86.8 | 221.68 | 45991.65 | ▇▁▁▁▁ |
| MONTH | 0 | 1 | 6.32 | 3.44 | 1.00 | 3.00 | 6.0 | 9.00 | 12.00 | ▇▆▆▅▇ |
# Assuming 'innovation' is your data frame
model <- lm(DOLLAR_SALES ~ UNIT_SALES + CALORIC_SEGMENT + PACKAGE + CATEGORY + SEASON + REGION, data = kiwano_small)
summary(model)
##
## Call:
## lm(formula = DOLLAR_SALES ~ UNIT_SALES + CALORIC_SEGMENT + PACKAGE +
## CATEGORY + SEASON + REGION, data = kiwano_small)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4689.1 -40.8 -5.9 38.3 6971.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.393e+02 1.103e+01 12.625 < 2e-16 ***
## UNIT_SALES 2.568e+00 1.992e-03 1288.714 < 2e-16 ***
## CALORIC_SEGMENT 1.117e+02 2.414e+00 46.279 < 2e-16 ***
## PACKAGE.5L 6ONE JUG -8.551e+01 6.049e+00 -14.137 < 2e-16 ***
## PACKAGE.5L MULTI JUG -1.233e+02 1.111e+01 -11.092 < 2e-16 ***
## PACKAGE1.25L MULTI JUG -3.811e+02 2.084e+01 -18.287 < 2e-16 ***
## PACKAGE12SMALL 12ONE CUP -8.777e+01 9.668e+00 -9.079 < 2e-16 ***
## PACKAGE12SMALL 24ONE CUP -2.348e+02 1.484e+01 -15.825 < 2e-16 ***
## PACKAGE12SMALL 8ONE CUP -8.347e+01 1.062e+01 -7.857 4.00e-15 ***
## PACKAGE12SMALL MLT BUMPY CUP -1.712e+02 7.514e+00 -22.790 < 2e-16 ***
## PACKAGE12SMALL MLT MEDIUM CUP -1.833e+02 4.247e+01 -4.317 1.58e-05 ***
## PACKAGE12SMALL MULTI CUP -1.020e+02 1.107e+01 -9.209 < 2e-16 ***
## PACKAGE16SMALL 12ONE CUP -1.417e+02 2.247e+01 -6.308 2.85e-10 ***
## PACKAGE16SMALL 24ONE CUP 3.268e+01 1.338e+01 2.442 0.014617 *
## PACKAGE16SMALL 4ONE CUP -8.254e+01 1.206e+01 -6.842 7.86e-12 ***
## PACKAGE16SMALL MULTI CUP -2.411e+02 1.079e+01 -22.335 < 2e-16 ***
## PACKAGE18SMALL 6ONE -3.718e+01 5.306e+00 -7.007 2.46e-12 ***
## PACKAGE18SMALL MULTI JUG -1.642e+02 4.463e+00 -36.784 < 2e-16 ***
## PACKAGE1L MULTI JUG -1.883e+02 2.070e+01 -9.096 < 2e-16 ***
## PACKAGE20SMALL 12ONE JUG -2.342e+02 6.947e+01 -3.371 0.000749 ***
## PACKAGE20SMALL MULTI JUG -2.175e+02 4.431e+00 -49.091 < 2e-16 ***
## PACKAGE24 - 25SMALL MULTI JUG -1.819e+02 5.587e+00 -32.557 < 2e-16 ***
## PACKAGE24SMALL MLT SHADYES JUG -1.985e+02 4.364e+01 -4.548 5.42e-06 ***
## PACKAGE2L MULTI JUG -2.168e+02 7.749e+00 -27.981 < 2e-16 ***
## PACKAGE7.5SMALL 10ONE -1.313e+02 1.064e+02 -1.234 0.217097
## PACKAGE8SMALL 12ONE CUP -3.811e+00 1.199e+01 -0.318 0.750544
## PACKAGE8SMALL 24ONE CUP -2.551e+02 2.122e+01 -12.022 < 2e-16 ***
## PACKAGE8SMALL 4ONE CUP -1.213e+02 1.151e+01 -10.540 < 2e-16 ***
## PACKAGE8SMALL MULTI CUP -3.477e+02 1.148e+01 -30.292 < 2e-16 ***
## PACKAGEALL OTHER ONES -2.758e+01 1.107e+01 -2.490 0.012769 *
## CATEGORYING ENHANCED WATER -1.694e+01 1.022e+01 -1.658 0.097326 .
## CATEGORYSPARKLING WATER -3.909e+00 3.442e+00 -1.136 0.256005
## CATEGORYSSD -6.247e+01 9.538e+00 -6.550 5.80e-11 ***
## SEASONSPRING -1.027e+01 1.944e+00 -5.282 1.28e-07 ***
## SEASONSUMMER -1.143e+01 2.017e+00 -5.667 1.46e-08 ***
## SEASONWINTER -7.458e+00 1.976e+00 -3.774 0.000161 ***
## REGIONCALI_NEVADA 4.919e+00 4.000e+00 1.230 0.218800
## REGIONCOLORADO 1.046e+01 2.480e+00 4.220 2.45e-05 ***
## REGIONDESERT_SW 2.392e+00 2.934e+00 0.815 0.414922
## REGIONKANSAS 1.757e+02 5.232e+00 33.571 < 2e-16 ***
## REGIONMOUNTAIN 1.032e+01 2.691e+00 3.835 0.000126 ***
## REGIONNEWMEXICO 1.735e+01 3.511e+00 4.943 7.71e-07 ***
## REGIONNOCAL 2.368e+00 3.761e+00 0.630 0.528893
## REGIONNORTHERN 8.657e+00 2.026e+00 4.273 1.93e-05 ***
## REGIONPRAIRIE 2.289e+01 4.318e+00 5.301 1.15e-07 ***
## REGIONSOCAL 1.705e+00 2.842e+00 0.600 0.548662
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 183.4 on 71210 degrees of freedom
## Multiple R-squared: 0.9675, Adjusted R-squared: 0.9675
## F-statistic: 4.707e+04 on 45 and 71210 DF, p-value: < 2.2e-16
Though Kiwano and energy drinks have very few rows. I do think there is potential here to find a good fitting model that can predict launch sales. I am thinking that if we can get a model that will predict the sales of uints of energy drinks, with size 16, and kiwano flavor we can then use that combined with the current sales rate of VENOMUS BLAST launches to get an accurate forecast. As far as selection of what weeks would be best to sell I don’t see any other way than by using historical best 13 weeks sales of either Venmous Blast, energy drinks, or kiwano flavored drinks.
# Load and prepare dataset
df1 <- read.csv("one_hot_kiwano.csv")
df1 <- df1 %>%
select(-DATE, -MONTH, -WINTER, -SPRING, -FALL, -DOLLAR_SALES, -SUMMER)
# Summarize the dataset
skimr::skim(df1)
| Name | df1 |
| Number of rows | 8082 |
| Number of columns | 31 |
| _______________________ | |
| Column type frequency: | |
| numeric | 31 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| UNIT_SALES | 0 | 1 | 171.20 | 468.61 | 0.5 | 10 | 66 | 214 | 10621 | ▇▁▁▁▁ |
| NORTHERN | 0 | 1 | 0.25 | 0.43 | 0.0 | 0 | 0 | 0 | 1 | ▇▁▁▁▂ |
| ARIZONA | 0 | 1 | 0.21 | 0.41 | 0.0 | 0 | 0 | 0 | 1 | ▇▁▁▁▂ |
| MOUNTAIN | 0 | 1 | 0.10 | 0.29 | 0.0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁ |
| COLORADO | 0 | 1 | 0.12 | 0.32 | 0.0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁ |
| DESERT_SW | 0 | 1 | 0.07 | 0.26 | 0.0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁ |
| NOCAL | 0 | 1 | 0.04 | 0.20 | 0.0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁ |
| SOCAL | 0 | 1 | 0.09 | 0.28 | 0.0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁ |
| KANSAS | 0 | 1 | 0.02 | 0.14 | 0.0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁ |
| NEWMEXICO | 0 | 1 | 0.04 | 0.20 | 0.0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁ |
| CALI_NEVADA | 0 | 1 | 0.04 | 0.19 | 0.0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁ |
| PRAIRIE | 0 | 1 | 0.03 | 0.17 | 0.0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁ |
| MYTHICAL.BEVERAGE.ULTRA | 0 | 1 | 0.55 | 0.50 | 0.0 | 0 | 1 | 1 | 1 | ▆▁▁▁▇ |
| SUPER.DUPER.PURE.ZERO | 0 | 1 | 0.37 | 0.48 | 0.0 | 0 | 0 | 1 | 1 | ▇▁▁▁▅ |
| HILL.MOISTURE.JUMPIN.FISH | 0 | 1 | 0.04 | 0.19 | 0.0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁ |
| VENOMOUS.BLAST | 0 | 1 | 0.03 | 0.17 | 0.0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁ |
| POW.POW | 0 | 1 | 0.01 | 0.10 | 0.0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁ |
| MYTHICAL.BEVERAGE.REHAB | 0 | 1 | 0.00 | 0.02 | 0.0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁ |
| MYTHICAL.BEVERAGE.ULTRA.KIWANO.UNFLAVORED. | 0 | 1 | 0.55 | 0.50 | 0.0 | 0 | 1 | 1 | 1 | ▆▁▁▁▇ |
| SUPER.DUPER.PURE.ZERO.KIWANO.KEKE. | 0 | 1 | 0.30 | 0.46 | 0.0 | 0 | 0 | 1 | 1 | ▇▁▁▁▃ |
| RAINING.JUMPIN.FISH.GAME.FUEL.ZERO.CHARGED.KIWANO.SHOCK. | 0 | 1 | 0.04 | 0.19 | 0.0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁ |
| SUPER.DUPER.PURE.ZERO.KIWANO. | 0 | 1 | 0.37 | 0.48 | 0.0 | 0 | 0 | 1 | 1 | ▇▁▁▁▅ |
| VENOMOUS.BLAST.KIWANO.DURIAN. | 0 | 1 | 0.03 | 0.17 | 0.0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁ |
| POW.POW.WYLDIN.KIWANO. | 0 | 1 | 0.01 | 0.10 | 0.0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁ |
| MYTHICAL.BEVERAGE.REHAB.KIWANO. | 0 | 1 | 0.00 | 0.02 | 0.0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁ |
| X16SMALL.4ONE.CUP | 0 | 1 | 0.15 | 0.36 | 0.0 | 0 | 0 | 0 | 1 | ▇▁▁▁▂ |
| X16SMALL.MULTI.CUP | 0 | 1 | 0.77 | 0.42 | 0.0 | 1 | 1 | 1 | 1 | ▂▁▁▁▇ |
| X16SMALL.24ONE.CUP | 0 | 1 | 0.07 | 0.25 | 0.0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁ |
| X16SMALL.12ONE.CUP | 0 | 1 | 0.01 | 0.10 | 0.0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁ |
| Week_Of_Year | 0 | 1 | 25.13 | 15.37 | 1.0 | 12 | 24 | 39 | 53 | ▇▇▆▆▆ |
| Week_Since_Launch | 0 | 1 | 65.58 | 40.11 | 1.0 | 29 | 66 | 99 | 152 | ▇▆▇▆▂ |
One Hot encoded down to just over 8000 rows from sampled data and up to 33 features.
#Remove outliers in top 1% of Unit Sales.
df1 <- df1 %>% filter(UNIT_SALES < quantile(UNIT_SALES, 0.99))
# Split the data
set.seed(123)
df_testtrn <- initial_split(df1, prop = 0.8, strata = UNIT_SALES)
Train <- training(df_testtrn)
Test <- testing(df_testtrn)
# Prepare features and labels for XGBoost
train_features <- Train[, -which(names(Train) == "UNIT_SALES")]
train_labels <- Train$UNIT_SALES
test_features <- Test[, -which(names(Test) == "UNIT_SALES")]
test_labels <- Test$UNIT_SALES
# Convert data to DMatrix format
dtrain <- xgb.DMatrix(data = as.matrix(train_features), label = train_labels)
dtest <- xgb.DMatrix(data = as.matrix(test_features), label = test_labels)
# Define XGBoost parameters
set.seed(123)
params <- list(
booster = "gbtree",
objective = "reg:squarederror",
eval_metric = "rmse",
eta = 0.05,
max_depth = 4,
min_child_weight = 3,
subsample = 0.7,
colsample_bytree = 0.6,
lambda = 1,
alpha = 1
)
# Perform cross-validation to find the optimal number of boosting rounds
cv_results <- xgb.cv(
params = params,
data = dtrain,
nfold = 5,
nrounds = 500, # Changed from 'num_boost_round' to 'nrounds'
early_stopping_rounds = 10,
metrics = "rmse",
seed = 123
)
## [1] train-rmse:217.705776+1.726588 test-rmse:217.646442+6.056629
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 10 rounds.
##
## [2] train-rmse:210.405385+1.852384 test-rmse:210.425573+6.148944
## [3] train-rmse:203.574471+1.951320 test-rmse:203.611020+5.928709
## [4] train-rmse:196.898200+1.718560 test-rmse:196.906546+6.172606
## [5] train-rmse:190.845352+1.592508 test-rmse:190.903124+6.130116
## [6] train-rmse:184.874631+1.561967 test-rmse:184.974468+6.172219
## [7] train-rmse:179.604110+1.212913 test-rmse:179.705730+6.458494
## [8] train-rmse:174.356396+1.188794 test-rmse:174.468449+6.500571
## [9] train-rmse:169.601819+0.902011 test-rmse:169.747975+6.774740
## [10] train-rmse:165.223534+1.017305 test-rmse:165.398048+6.702410
## [11] train-rmse:161.041804+0.800571 test-rmse:161.187727+6.906616
## [12] train-rmse:157.075138+0.741441 test-rmse:157.214980+6.924278
## [13] train-rmse:153.672930+0.970396 test-rmse:153.905586+6.851054
## [14] train-rmse:150.488336+1.236060 test-rmse:150.717887+6.443776
## [15] train-rmse:147.429706+1.486844 test-rmse:147.678809+6.255446
## [16] train-rmse:144.559867+1.666008 test-rmse:144.814005+6.056320
## [17] train-rmse:141.878498+1.542848 test-rmse:142.137111+6.138557
## [18] train-rmse:139.568967+1.705662 test-rmse:139.851989+5.967131
## [19] train-rmse:137.192000+1.639567 test-rmse:137.577392+6.046113
## [20] train-rmse:135.116385+1.691470 test-rmse:135.514617+5.985587
## [21] train-rmse:133.312662+1.609342 test-rmse:133.749912+6.113063
## [22] train-rmse:131.748143+1.331677 test-rmse:132.238788+6.382739
## [23] train-rmse:130.185531+1.383939 test-rmse:130.667299+6.364131
## [24] train-rmse:128.612633+1.166141 test-rmse:129.135985+6.517994
## [25] train-rmse:126.906077+1.097951 test-rmse:127.467588+6.515756
## [26] train-rmse:125.594452+1.040191 test-rmse:126.168142+6.481015
## [27] train-rmse:124.185118+0.992398 test-rmse:124.745337+6.427232
## [28] train-rmse:123.005509+1.058536 test-rmse:123.608229+6.286645
## [29] train-rmse:122.040433+1.143680 test-rmse:122.627676+6.172173
## [30] train-rmse:121.084636+1.184286 test-rmse:121.701602+6.109215
## [31] train-rmse:120.010516+1.146676 test-rmse:120.644412+6.075442
## [32] train-rmse:119.138369+1.041651 test-rmse:119.811827+6.130964
## [33] train-rmse:118.270005+1.038676 test-rmse:118.975096+6.114711
## [34] train-rmse:117.406429+1.042121 test-rmse:118.130063+6.062164
## [35] train-rmse:116.671483+1.066339 test-rmse:117.424609+5.991489
## [36] train-rmse:116.038411+1.108969 test-rmse:116.847059+5.938512
## [37] train-rmse:115.411318+1.189561 test-rmse:116.262070+5.836692
## [38] train-rmse:114.769714+1.182860 test-rmse:115.650483+5.787148
## [39] train-rmse:114.233882+1.213768 test-rmse:115.144265+5.735848
## [40] train-rmse:113.711103+1.180795 test-rmse:114.665068+5.717172
## [41] train-rmse:113.237289+1.181022 test-rmse:114.224130+5.682975
## [42] train-rmse:112.786174+1.164327 test-rmse:113.806499+5.647998
## [43] train-rmse:112.358566+1.215843 test-rmse:113.381558+5.556097
## [44] train-rmse:111.908613+1.207228 test-rmse:112.990934+5.509302
## [45] train-rmse:111.505880+1.227312 test-rmse:112.609532+5.480405
## [46] train-rmse:111.159853+1.227209 test-rmse:112.269001+5.449478
## [47] train-rmse:110.837257+1.264560 test-rmse:111.953361+5.369060
## [48] train-rmse:110.496226+1.228425 test-rmse:111.648858+5.336585
## [49] train-rmse:110.207935+1.247835 test-rmse:111.388688+5.275763
## [50] train-rmse:109.896859+1.230994 test-rmse:111.095440+5.250226
## [51] train-rmse:109.630616+1.209032 test-rmse:110.873257+5.226998
## [52] train-rmse:109.395453+1.171001 test-rmse:110.681952+5.240345
## [53] train-rmse:109.178581+1.170910 test-rmse:110.500511+5.215448
## [54] train-rmse:108.959216+1.204725 test-rmse:110.270525+5.124835
## [55] train-rmse:108.736562+1.189279 test-rmse:110.089529+5.092821
## [56] train-rmse:108.520950+1.191878 test-rmse:109.892245+5.050693
## [57] train-rmse:108.303594+1.197308 test-rmse:109.679085+4.974027
## [58] train-rmse:108.123674+1.204976 test-rmse:109.524332+4.918725
## [59] train-rmse:107.949782+1.195668 test-rmse:109.372370+4.918172
## [60] train-rmse:107.814083+1.201812 test-rmse:109.260263+4.914265
## [61] train-rmse:107.659414+1.179929 test-rmse:109.130108+4.895703
## [62] train-rmse:107.488704+1.185633 test-rmse:108.996983+4.862993
## [63] train-rmse:107.354358+1.166640 test-rmse:108.886217+4.836369
## [64] train-rmse:107.201045+1.154677 test-rmse:108.764455+4.809095
## [65] train-rmse:107.067648+1.166265 test-rmse:108.671354+4.796509
## [66] train-rmse:106.926545+1.171891 test-rmse:108.536175+4.749954
## [67] train-rmse:106.833710+1.169548 test-rmse:108.454146+4.742443
## [68] train-rmse:106.706328+1.163537 test-rmse:108.348051+4.742568
## [69] train-rmse:106.583383+1.157084 test-rmse:108.252509+4.702418
## [70] train-rmse:106.450339+1.157228 test-rmse:108.159475+4.713849
## [71] train-rmse:106.315701+1.156993 test-rmse:108.063895+4.699920
## [72] train-rmse:106.194928+1.151865 test-rmse:108.002515+4.657357
## [73] train-rmse:106.089994+1.126558 test-rmse:107.910730+4.674931
## [74] train-rmse:105.985273+1.117598 test-rmse:107.833325+4.671763
## [75] train-rmse:105.872339+1.126851 test-rmse:107.722153+4.639997
## [76] train-rmse:105.768401+1.150273 test-rmse:107.632913+4.610496
## [77] train-rmse:105.684353+1.142778 test-rmse:107.559957+4.621758
## [78] train-rmse:105.606365+1.132159 test-rmse:107.485752+4.635423
## [79] train-rmse:105.528355+1.137745 test-rmse:107.431489+4.613220
## [80] train-rmse:105.439508+1.130636 test-rmse:107.371627+4.596848
## [81] train-rmse:105.349802+1.114822 test-rmse:107.302164+4.583688
## [82] train-rmse:105.242874+1.126858 test-rmse:107.224359+4.578922
## [83] train-rmse:105.144347+1.113740 test-rmse:107.145975+4.598302
## [84] train-rmse:105.042366+1.097820 test-rmse:107.087537+4.563330
## [85] train-rmse:104.979591+1.097749 test-rmse:107.042504+4.571220
## [86] train-rmse:104.908312+1.104987 test-rmse:106.998774+4.542830
## [87] train-rmse:104.835939+1.109647 test-rmse:106.959278+4.531036
## [88] train-rmse:104.743684+1.110854 test-rmse:106.906384+4.511412
## [89] train-rmse:104.675038+1.115236 test-rmse:106.845715+4.473053
## [90] train-rmse:104.610710+1.118805 test-rmse:106.818077+4.468535
## [91] train-rmse:104.541904+1.098816 test-rmse:106.785860+4.479545
## [92] train-rmse:104.475743+1.112554 test-rmse:106.729978+4.438968
## [93] train-rmse:104.389205+1.107802 test-rmse:106.672804+4.440723
## [94] train-rmse:104.336561+1.106887 test-rmse:106.644459+4.434061
## [95] train-rmse:104.280621+1.103414 test-rmse:106.589836+4.422531
## [96] train-rmse:104.209040+1.096128 test-rmse:106.566293+4.423767
## [97] train-rmse:104.139996+1.090528 test-rmse:106.518266+4.410642
## [98] train-rmse:104.075481+1.072181 test-rmse:106.473619+4.408489
## [99] train-rmse:104.032712+1.069677 test-rmse:106.446107+4.401632
## [100] train-rmse:103.960357+1.063107 test-rmse:106.409175+4.387759
## [101] train-rmse:103.903838+1.053711 test-rmse:106.377790+4.382538
## [102] train-rmse:103.855284+1.049574 test-rmse:106.347117+4.395553
## [103] train-rmse:103.785864+1.074153 test-rmse:106.293021+4.366604
## [104] train-rmse:103.725926+1.066042 test-rmse:106.251507+4.371974
## [105] train-rmse:103.671380+1.075536 test-rmse:106.222887+4.357470
## [106] train-rmse:103.624556+1.078243 test-rmse:106.215842+4.350971
## [107] train-rmse:103.563840+1.084098 test-rmse:106.192279+4.335865
## [108] train-rmse:103.502446+1.102615 test-rmse:106.150729+4.294037
## [109] train-rmse:103.458180+1.098972 test-rmse:106.117737+4.304934
## [110] train-rmse:103.411051+1.085172 test-rmse:106.103294+4.309691
## [111] train-rmse:103.370825+1.085410 test-rmse:106.096305+4.314054
## [112] train-rmse:103.317972+1.083958 test-rmse:106.065302+4.317350
## [113] train-rmse:103.262625+1.075745 test-rmse:106.029728+4.318751
## [114] train-rmse:103.225212+1.066823 test-rmse:106.003426+4.313490
## [115] train-rmse:103.164468+1.062869 test-rmse:105.981811+4.295438
## [116] train-rmse:103.106271+1.057822 test-rmse:105.961667+4.322044
## [117] train-rmse:103.054937+1.055968 test-rmse:105.940956+4.310770
## [118] train-rmse:103.015156+1.055820 test-rmse:105.915448+4.287661
## [119] train-rmse:102.955009+1.051020 test-rmse:105.901180+4.273945
## [120] train-rmse:102.906140+1.046282 test-rmse:105.865737+4.266017
## [121] train-rmse:102.868256+1.057312 test-rmse:105.841330+4.246234
## [122] train-rmse:102.813576+1.060385 test-rmse:105.814376+4.232707
## [123] train-rmse:102.757229+1.088211 test-rmse:105.791383+4.212717
## [124] train-rmse:102.699640+1.088199 test-rmse:105.759068+4.198639
## [125] train-rmse:102.662515+1.094546 test-rmse:105.758886+4.200814
## [126] train-rmse:102.590459+1.101506 test-rmse:105.719216+4.208274
## [127] train-rmse:102.537992+1.107099 test-rmse:105.704954+4.207722
## [128] train-rmse:102.501141+1.104981 test-rmse:105.683540+4.210462
## [129] train-rmse:102.454916+1.100548 test-rmse:105.649931+4.212549
## [130] train-rmse:102.420911+1.095504 test-rmse:105.641604+4.215102
## [131] train-rmse:102.382171+1.093489 test-rmse:105.632402+4.220375
## [132] train-rmse:102.343972+1.087291 test-rmse:105.615151+4.221160
## [133] train-rmse:102.301734+1.094049 test-rmse:105.605219+4.204597
## [134] train-rmse:102.259216+1.088933 test-rmse:105.588173+4.177427
## [135] train-rmse:102.197632+1.090465 test-rmse:105.553342+4.164299
## [136] train-rmse:102.159578+1.092124 test-rmse:105.563892+4.158929
## [137] train-rmse:102.117678+1.091260 test-rmse:105.535869+4.139190
## [138] train-rmse:102.073379+1.088426 test-rmse:105.513628+4.137416
## [139] train-rmse:102.044361+1.084385 test-rmse:105.491182+4.136099
## [140] train-rmse:101.978440+1.056975 test-rmse:105.450249+4.160764
## [141] train-rmse:101.949579+1.054336 test-rmse:105.447676+4.149661
## [142] train-rmse:101.899963+1.038619 test-rmse:105.408220+4.150925
## [143] train-rmse:101.867287+1.033809 test-rmse:105.382365+4.158884
## [144] train-rmse:101.834973+1.028955 test-rmse:105.376583+4.153997
## [145] train-rmse:101.800781+1.020013 test-rmse:105.355388+4.163804
## [146] train-rmse:101.761408+1.029571 test-rmse:105.331032+4.163345
## [147] train-rmse:101.732601+1.028802 test-rmse:105.324131+4.156258
## [148] train-rmse:101.695686+1.030180 test-rmse:105.332667+4.146834
## [149] train-rmse:101.663309+1.025870 test-rmse:105.310385+4.144242
## [150] train-rmse:101.631562+1.024428 test-rmse:105.309080+4.151472
## [151] train-rmse:101.614917+1.024407 test-rmse:105.311228+4.140177
## [152] train-rmse:101.580140+1.022296 test-rmse:105.307063+4.137872
## [153] train-rmse:101.551503+1.012723 test-rmse:105.303371+4.157111
## [154] train-rmse:101.491454+1.019352 test-rmse:105.287224+4.130295
## [155] train-rmse:101.454370+1.010309 test-rmse:105.266561+4.128269
## [156] train-rmse:101.426843+1.005533 test-rmse:105.275174+4.119941
## [157] train-rmse:101.402391+1.005779 test-rmse:105.268038+4.112996
## [158] train-rmse:101.364443+1.000622 test-rmse:105.244424+4.105382
## [159] train-rmse:101.325293+1.010992 test-rmse:105.235812+4.113047
## [160] train-rmse:101.297977+1.012798 test-rmse:105.234310+4.098510
## [161] train-rmse:101.269076+1.007082 test-rmse:105.232939+4.097727
## [162] train-rmse:101.239589+1.000364 test-rmse:105.231916+4.094354
## [163] train-rmse:101.206147+1.005057 test-rmse:105.220184+4.082585
## [164] train-rmse:101.178684+1.003359 test-rmse:105.204772+4.085032
## [165] train-rmse:101.155662+0.993086 test-rmse:105.182076+4.090590
## [166] train-rmse:101.117763+0.999881 test-rmse:105.164226+4.066201
## [167] train-rmse:101.092743+0.988887 test-rmse:105.155499+4.066326
## [168] train-rmse:101.065961+0.977494 test-rmse:105.162879+4.054113
## [169] train-rmse:101.035898+0.983972 test-rmse:105.166591+4.047285
## [170] train-rmse:101.008053+0.985241 test-rmse:105.148481+4.033395
## [171] train-rmse:100.962330+0.986424 test-rmse:105.147480+4.043883
## [172] train-rmse:100.933215+0.981249 test-rmse:105.140726+4.031063
## [173] train-rmse:100.901074+0.988444 test-rmse:105.140361+4.006925
## [174] train-rmse:100.875753+0.986333 test-rmse:105.130550+3.999878
## [175] train-rmse:100.849817+0.983349 test-rmse:105.116837+3.986140
## [176] train-rmse:100.817544+0.983382 test-rmse:105.112570+3.985896
## [177] train-rmse:100.795199+0.981469 test-rmse:105.098115+3.986405
## [178] train-rmse:100.766856+0.982664 test-rmse:105.087798+4.001022
## [179] train-rmse:100.750584+0.985351 test-rmse:105.083388+3.996594
## [180] train-rmse:100.722144+0.989597 test-rmse:105.091095+3.996774
## [181] train-rmse:100.687992+0.990381 test-rmse:105.094837+3.994948
## [182] train-rmse:100.656229+0.994809 test-rmse:105.108010+3.994613
## [183] train-rmse:100.628636+0.994684 test-rmse:105.100448+4.003163
## [184] train-rmse:100.603603+1.004044 test-rmse:105.089808+3.997248
## [185] train-rmse:100.571167+1.017227 test-rmse:105.082038+4.000785
## [186] train-rmse:100.533987+1.019749 test-rmse:105.083299+3.990631
## [187] train-rmse:100.513657+1.015290 test-rmse:105.085105+3.988404
## [188] train-rmse:100.486879+1.024148 test-rmse:105.067510+3.984495
## [189] train-rmse:100.464710+1.019841 test-rmse:105.052180+3.985632
## [190] train-rmse:100.445860+1.017233 test-rmse:105.050569+3.978194
## [191] train-rmse:100.412713+1.023659 test-rmse:105.026976+3.960303
## [192] train-rmse:100.378254+1.017109 test-rmse:105.025809+3.964908
## [193] train-rmse:100.348592+1.009292 test-rmse:105.022782+3.966474
## [194] train-rmse:100.320556+1.004093 test-rmse:105.022983+3.963966
## [195] train-rmse:100.300810+0.999951 test-rmse:105.019926+3.964219
## [196] train-rmse:100.276388+0.995213 test-rmse:105.019174+3.958770
## [197] train-rmse:100.244758+0.991710 test-rmse:105.013163+3.953816
## [198] train-rmse:100.218024+0.979115 test-rmse:105.017656+3.940724
## [199] train-rmse:100.192420+0.978570 test-rmse:105.022365+3.924850
## [200] train-rmse:100.175061+0.979360 test-rmse:105.016225+3.905252
## [201] train-rmse:100.141860+0.979913 test-rmse:105.011539+3.893110
## [202] train-rmse:100.119294+0.980655 test-rmse:105.021615+3.889566
## [203] train-rmse:100.087983+0.980008 test-rmse:105.013679+3.879009
## [204] train-rmse:100.067846+0.979039 test-rmse:105.002239+3.896566
## [205] train-rmse:100.046014+0.983494 test-rmse:105.009523+3.894700
## [206] train-rmse:100.010933+0.982794 test-rmse:104.999094+3.893370
## [207] train-rmse:99.985484+0.971739 test-rmse:105.006011+3.900278
## [208] train-rmse:99.968869+0.969234 test-rmse:104.990949+3.906836
## [209] train-rmse:99.938276+0.978956 test-rmse:104.979354+3.891847
## [210] train-rmse:99.916563+0.985604 test-rmse:104.981912+3.879570
## [211] train-rmse:99.900399+0.984325 test-rmse:104.977073+3.881675
## [212] train-rmse:99.880024+0.988301 test-rmse:104.962749+3.863712
## [213] train-rmse:99.856189+0.995239 test-rmse:104.972623+3.863823
## [214] train-rmse:99.831703+0.999802 test-rmse:104.974014+3.861844
## [215] train-rmse:99.808883+0.999512 test-rmse:104.975537+3.858152
## [216] train-rmse:99.790881+1.004848 test-rmse:104.962918+3.846758
## [217] train-rmse:99.772378+1.007723 test-rmse:104.946749+3.836517
## [218] train-rmse:99.755767+1.005294 test-rmse:104.944672+3.835519
## [219] train-rmse:99.734916+1.006098 test-rmse:104.948898+3.829234
## [220] train-rmse:99.716835+1.003819 test-rmse:104.953244+3.831615
## [221] train-rmse:99.681587+0.986968 test-rmse:104.952624+3.828575
## [222] train-rmse:99.664689+0.987398 test-rmse:104.936489+3.825513
## [223] train-rmse:99.641951+0.984801 test-rmse:104.946943+3.812131
## [224] train-rmse:99.620762+0.989544 test-rmse:104.945522+3.809201
## [225] train-rmse:99.608931+0.987994 test-rmse:104.947984+3.809749
## [226] train-rmse:99.594940+0.993725 test-rmse:104.932484+3.812721
## [227] train-rmse:99.572979+0.998810 test-rmse:104.940761+3.811802
## [228] train-rmse:99.551276+0.996702 test-rmse:104.945003+3.814768
## [229] train-rmse:99.534661+0.993734 test-rmse:104.947652+3.809602
## [230] train-rmse:99.518743+0.986978 test-rmse:104.945441+3.815927
## [231] train-rmse:99.506661+0.987994 test-rmse:104.949409+3.808613
## [232] train-rmse:99.483904+0.979884 test-rmse:104.939976+3.796890
## [233] train-rmse:99.458698+0.970649 test-rmse:104.941916+3.793822
## [234] train-rmse:99.441665+0.971398 test-rmse:104.955933+3.800672
## [235] train-rmse:99.429076+0.968957 test-rmse:104.952240+3.814654
## [236] train-rmse:99.398246+0.967131 test-rmse:104.948200+3.811817
## Stopping. Best iteration:
## [226] train-rmse:99.594940+0.993725 test-rmse:104.932484+3.812721
best_nrounds <- cv_results$best_iteration
# Train the final model using the best number of rounds found
model_xgb <- xgb.train(
params = params,
data = dtrain,
nrounds = best_nrounds
)
# Make predictions and evaluate the model
train_pred <- predict(model_xgb, dtrain)
test_pred <- predict(model_xgb, dtest)
train_rmse <- sqrt(mean((train_labels - train_pred)^2))
test_rmse <- sqrt(mean((test_labels - test_pred)^2))
# Calculate R-squared for the training set
sst_train <- sum((train_labels - mean(train_labels)) ^ 2)
ssr_train <- sum((train_labels - train_pred) ^ 2)
r_squared_train <- 1 - (ssr_train / sst_train)
# Calculate R-squared for the test set
sst_test <- sum((test_labels - mean(test_labels)) ^ 2)
ssr_test <- sum((test_labels - test_pred) ^ 2)
r_squared_test <- 1 - (ssr_test / sst_test)
train_mape <- mean(abs((train_labels - train_pred) / train_labels)) * 100
test_mape <- mean(abs((test_labels - test_pred) / test_labels)) * 100
train_mae <- mean(abs(train_labels - train_pred))
test_mae <- mean(abs(test_labels - test_pred))
cat("Model Performance Metrics:\n",
"--------------------------\n",
"Training RMSE: ", train_rmse, "\n",
"Test RMSE: ", test_rmse, "\n",
"Training R-squared: ", r_squared_train, "\n",
"Test R-squared: ", r_squared_test, "\n",
"Training MAE: ", train_mae, "\n",
"Test MAE: ", test_mae, "\n",
"Training MAPE: ", train_mape, "%\n",
"Test MAPE: ", test_mape, "%\n", sep="")
## Model Performance Metrics:
## --------------------------
## Training RMSE: 100.2079
## Test RMSE: 109.262
## Training R-squared: 0.6840278
## Test R-squared: 0.6283978
## Training MAE: 58.94409
## Test MAE: 62.40066
## Training MAPE: 231.2208%
## Test MAPE: 223.7863%
For the Kiwano Energy model, Our train RMSE is 100.29 and test 109.32. We expect to see the drop from train to test. With the difference we may need to check if there is slight overfitting. With the R2 for test and train are both moderate at .68 training .67 testing, this indicates there is some but not all variance eplained by our model. Our MAE also is low and does not contain a significant difference between training and test. The last metric, MAPE, both values are at 232% meaning that we are with about 224% of the actual values. Overall this model does show some predictive power but with more features we maybe able to get stronger predictive power.
# Calculate feature importance
importance_matrix2 <- xgb.importance(feature_names = colnames(train_features), model = model_xgb)
# View the feature importance scores
print(importance_matrix2)
## Feature Gain
## 1: MYTHICAL.BEVERAGE.ULTRA 1.976594e-01
## 2: X16SMALL.MULTI.CUP 1.613989e-01
## 3: MYTHICAL.BEVERAGE.ULTRA.KIWANO.UNFLAVORED. 1.477780e-01
## 4: X16SMALL.4ONE.CUP 8.659552e-02
## 5: Week_Since_Launch 8.278842e-02
## 6: X16SMALL.24ONE.CUP 8.058614e-02
## 7: SUPER.DUPER.PURE.ZERO 4.150738e-02
## 8: PRAIRIE 3.174524e-02
## 9: COLORADO 2.806605e-02
## 10: SUPER.DUPER.PURE.ZERO.KIWANO.KEKE. 2.458380e-02
## 11: KANSAS 1.890134e-02
## 12: NOCAL 1.401292e-02
## 13: HILL.MOISTURE.JUMPIN.FISH 1.366497e-02
## 14: Week_Of_Year 1.297703e-02
## 15: NORTHERN 8.778455e-03
## 16: VENOMOUS.BLAST 8.147177e-03
## 17: SUPER.DUPER.PURE.ZERO.KIWANO. 7.398521e-03
## 18: ARIZONA 6.004959e-03
## 19: CALI_NEVADA 5.905601e-03
## 20: SOCAL 5.748893e-03
## 21: DESERT_SW 3.905222e-03
## 22: RAINING.JUMPIN.FISH.GAME.FUEL.ZERO.CHARGED.KIWANO.SHOCK. 3.858922e-03
## 23: MOUNTAIN 3.597554e-03
## 24: NEWMEXICO 2.997507e-03
## 25: VENOMOUS.BLAST.KIWANO.DURIAN. 1.372027e-03
## 26: POW.POW 1.096999e-05
## 27: POW.POW.WYLDIN.KIWANO. 7.459850e-06
## 28: X16SMALL.12ONE.CUP 1.628477e-06
## Feature Gain
## Cover Frequency
## 1: 5.625228e-02 0.0738948048
## 2: 7.601545e-02 0.0580832527
## 3: 3.900313e-02 0.0319457890
## 4: 1.227161e-02 0.0293643111
## 5: 2.398665e-01 0.2362052275
## 6: 1.117101e-02 0.0183930300
## 7: 1.440127e-02 0.0225879316
## 8: 4.779081e-02 0.0319457890
## 9: 3.974239e-02 0.0325911584
## 10: 3.073693e-02 0.0393675379
## 11: 5.117822e-02 0.0454985479
## 12: 3.887320e-02 0.0238786705
## 13: 2.396210e-03 0.0054856405
## 14: 1.123534e-01 0.1613423685
## 15: 2.704485e-02 0.0325911584
## 16: 9.647209e-03 0.0093578574
## 17: 3.921251e-03 0.0054856405
## 18: 2.087564e-02 0.0242013553
## 19: 4.056876e-02 0.0190383995
## 20: 3.233374e-02 0.0209745079
## 21: 3.179025e-02 0.0190383995
## 22: 1.894793e-03 0.0032268474
## 23: 3.500368e-02 0.0245240400
## 24: 2.208191e-02 0.0200064537
## 25: 2.086846e-03 0.0048402711
## 26: 3.756917e-04 0.0038722168
## 27: 3.128289e-04 0.0019361084
## 28: 1.014714e-05 0.0003226847
## Cover Frequency
xgb.plot.importance(importance_matrix = importance_matrix2)
From this Importance matrix we see that brand and size seem to be the two biggest contributors to our model. We also see that the created featrure Week_Since_Launche is playing a large part in the creation of predictions.
Our next series of processes, for presentation to the client, include revisiting a number of critical variables and insights we found along the way through this model submission. This includes deliberate forecasting and time series modeling on real datasets that represent products that any new innovation would hope to pursue in the market. We will also focus on demand in prescriptive analytics. There are infinite tasks we could employ.
All team members participated equally to the creation of this notebook. Each member took a lengthy look at each of the questions posed by the client, Swire, and we argued over those that presented an evident way forward. Our team is aware that this project is barely half over and the submission of this notebook constitutes the basis on which we will substantiate claims made to the client.